biais.org

Saturday 6 December 2008

OpenCalais: Semantic Analysis Web Service

OpenCalais is a free web service that can perform semantic analysis on any English text. It processes the text sent in your request and respond with extracted concepts and relationships. It's a great tool if you want to play with semantics and if you want to add some nice features to your website / blog.

As an example, I tried to send the text from a this small article about Ruby and Python. Note : For readability I kept only interesting data from the response :

<!-- 
Relations: 
ProgrammingLanguage: Python, Ruby
--> 
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:c="http://s.opencalais.com/1/pred/">
  <rdf:Description rdf:about="...">
    <!-- ProgrammingLanguage: Python; --> 
    <c:detection>[similarities and differences between Ruby and ]Python[ but I didn't find any idioms list in Ruby, so if]</c:detection> 
    <c:prefix>similarities and differences between Ruby and</c:prefix> 
    <c:exact>Python</c:exact> 
    <c:suffix>but I didn't find any idioms list in Ruby, so if</c:suffix> 
    <c:relevance>0.543</c:relevance> 
  </rdf:Description>
 
  <rdf:Description rdf:about="...">
    <!-- ProgrammingLanguage: Ruby; --> 
    <c:detection>[ list in Ruby, so if you know one or if you are a ]Ruby[ programmer, please post a]</c:detection> 
    <c:prefix>list in Ruby, so if you know one or if you are a</c:prefix> 
    <c:exact>Ruby</c:exact> 
    <c:suffix>programmer, please post a</c:suffix> 
    <c:relevance>0.386</c:relevance> 
  </rdf:Description>
</rdf:RDF>

The analyzed text is quite small but the results seems OK : 2 programming languages detected here, no animal, no gemtone...

Friday 5 December 2008

DBPedia 3.2 Including DBpedia Ontology

If you like semantics or if you work on NLP projects, you should already know DBPedia. Database plus a set of tools that allow you to ask sophisticated queries against Wikipedia. Some days ago, DBPedia 3.2 was released and now, it includes DBpedia Ontology, a manually created cross-domain ontology based on the most commonly used infoboxes within Wikipedia.

Read more on the official announcement

Note: I'm a bit late on this news, I will try to update the blog more often.

Saturday 5 April 2008

Machine Translation Techniques and Open Source

Today there two main approaches to Machine Translation (MT)

  • Rules based MT (used by numbers of companies working in the domain: Systran, Reverso, etc.). The only open source software I know that works with this approach is Apertium.
  • Statistical based MT (used by Google and Language Weaver). Moses is an open source implementation of this approach. Also, the learning process is supported by other open source layers. (for example giza++ is an open source word aligner needed by moses to prepare the corpus).
Pros and cons of rules based machine translation
  • It needs rules, dictionaries (general and contextual) and people with the know how (linguists) to write this rules and fill dictionaries.
  • Translation costs (CPU and memory) are fairly low
Pros and cons of statistical based machine translation
  • It needs big bilingual corpus and computer ressources to run the learning process
  • The bilingual corpus have to be clean (automatic pre process and human checking)
  • Translation costs are heavy
  • You can translate in all pair languages you want if you got the corpus
Resources:

Notes: there is other less used techniques; word to word substitution (Linguaphile, example based translation (I didn't find open source implementation of this one), of course, you can imagine mixed techniques.

Wednesday 31 January 2007

Spelling correction using the Python Natural Language Toolkit (nltk)

Natural Language Toolkit (nltk) is an amazing library to play with natural language. I read an article about spelling correction, and I wanted to use nltk to code something useful. The purpose of the following code is to implement a similar "Did You Mean" feature used by search engines. Note: to run the code, you need to install nltk.

How it works:

  • Learning. From a word dictionary or a corpus of sentences: create a dict containing for each word specialhash, the word and its number of occurrences.
  • Testing. The tested word is hashed, we get the associated words in the learned db and print them sorted by number of occurrences.

Notes:

  • Porter class create a stemmer object with a stem function returning the word stem (removing morphological affixes). For example stem('operation'), stem('operator'), stem('operating') all return the string 'oper'.
  • we expect words and erroneous words collide in the dictionary to return the correct ones.
  • brown.raw() return a generator that iterates over the brown corpus sentences.
from nltk_lite.stem.porter import Porter
from nltk_lite.corpora import brown
 
import sys
from collections import defaultdict
import operator
 
def sortby(nlist ,n, reverse=0):
    nlist.sort(key=operator.itemgetter(n), reverse=reverse)
 
class mydict(dict):
    def __missing__(self, key):
        return 0
 
class DidYouMean:
    def __init__(self):
        self.stemmer = Porter()
 
    def specialhash(self, s):
        s = s.lower()
        s = s.replace("z", "s")
        s = s.replace("h", "")
        for i in [chr(ord("a") + i) for i in range(26)]:
            s = s.replace(i+i, i)
        s = self.stemmer.stem(s)
        return s
 
    def test(self, token):
        hashed = self.specialhash(token)
        if hashed in self.learned:
            words = self.learned[hashed].items()
            sortby(words, 1, reverse=1)
            if token in [i[0] for i in words]:
                return 'This word seems OK'
            else:
                if len(words) == 1:
                    return 'Did you mean "%s" ?' % words[0][0]
                else:
                    return 'Did you mean "%s" ? (or %s)' \
                           % (words[0][0], ", ".join(['"'+i[0]+'"' \
                                                      for i in words[1:]]))
        return "I can't find similar word in my learned db"
 
    def learn(self, listofsentences=[], n=2000):
        self.learned = defaultdict(mydict)
        if listofsentences == []:
            listofsentences = brown.raw()
        for i, sent in enumerate(listofsentences):
            if i >= n: # Limit to the first nth sentences of the corpus
                break
            for word in sent:
                self.learned[self.specialhash(word)][word.lower()] += 1
 
def demo():
    d = DidYouMean()
    d.learn()
    # choice of words to be relevant related to the brown corpus
    for i in "birdd, oklaoma, emphasise, bird, carot".split(", "):
        print i, "-", d.test(i)
 
if __name__ == "__main__":
    demo()

outputs:

birdd - Did you mean "birds" ? (or "bird")
oklaoma - Did you mean "oklahoma" ?
emphasise - Did you mean "emphasize" ? (or "emphasizes", "emphasizing")
bird - This word seems OK
carot - I can't find similar word in my learned db

It's a minimalist "Did you mean" implementation, google or yahoo use context and number of results to give you the best similar words. Also, here the specialhash method is an empirical way to reduce words. Rather than using a dictionary and specialhash, we may use simple list and a similarity method that computes a percentage of similarity between two words, but it would be less efficient to run the test method.