Natural Language Toolkit (nltk) is an amazing library to play with natural language. I read an article about spelling correction, and I wanted to use nltk to code something useful. The purpose of the following code is to implement a similar "Did You Mean" feature used by search engines. Note: to run the code, you need to install nltk.

How it works:

  • Learning. From a word dictionary or a corpus of sentences: create a dict containing for each word specialhash, the word and its number of occurrences.
  • Testing. The tested word is hashed, we get the associated words in the learned db and print them sorted by number of occurrences.

Notes:

  • Porter class create a stemmer object with a stem function returning the word stem (removing morphological affixes). For example stem('operation'), stem('operator'), stem('operating') all return the string 'oper'.
  • we expect words and erroneous words collide in the dictionary to return the correct ones.
  • brown.raw() return a generator that iterates over the brown corpus sentences.
from nltk_lite.stem.porter import Porter
from nltk_lite.corpora import brown
 
import sys
from collections import defaultdict
import operator
 
def sortby(nlist ,n, reverse=0):
    nlist.sort(key=operator.itemgetter(n), reverse=reverse)
 
class mydict(dict):
    def __missing__(self, key):
        return 0
 
class DidYouMean:
    def __init__(self):
        self.stemmer = Porter()
 
    def specialhash(self, s):
        s = s.lower()
        s = s.replace("z", "s")
        s = s.replace("h", "")
        for i in [chr(ord("a") + i) for i in range(26)]:
            s = s.replace(i+i, i)
        s = self.stemmer.stem(s)
        return s
 
    def test(self, token):
        hashed = self.specialhash(token)
        if hashed in self.learned:
            words = self.learned[hashed].items()
            sortby(words, 1, reverse=1)
            if token in [i[0] for i in words]:
                return 'This word seems OK'
            else:
                if len(words) == 1:
                    return 'Did you mean "%s" ?' % words[0][0]
                else:
                    return 'Did you mean "%s" ? (or %s)' \
                           % (words[0][0], ", ".join(['"'+i[0]+'"' \
                                                      for i in words[1:]]))
        return "I can't find similar word in my learned db"
 
    def learn(self, listofsentences=[], n=2000):
        self.learned = defaultdict(mydict)
        if listofsentences == []:
            listofsentences = brown.raw()
        for i, sent in enumerate(listofsentences):
            if i >= n: # Limit to the first nth sentences of the corpus
                break
            for word in sent:
                self.learned[self.specialhash(word)][word.lower()] += 1
 
def demo():
    d = DidYouMean()
    d.learn()
    # choice of words to be relevant related to the brown corpus
    for i in "birdd, oklaoma, emphasise, bird, carot".split(", "):
        print i, "-", d.test(i)
 
if __name__ == "__main__":
    demo()

outputs:

birdd - Did you mean "birds" ? (or "bird")
oklaoma - Did you mean "oklahoma" ?
emphasise - Did you mean "emphasize" ? (or "emphasizes", "emphasizing")
bird - This word seems OK
carot - I can't find similar word in my learned db

It's a minimalist "Did you mean" implementation, google or yahoo use context and number of results to give you the best similar words. Also, here the specialhash method is an empirical way to reduce words. Rather than using a dictionary and specialhash, we may use simple list and a similarity method that computes a percentage of similarity between two words, but it would be less efficient to run the test method.