Wednesday 31 January 2007
Spelling correction using the Python Natural Language Toolkit (nltk)
By Maxime Biais, Wednesday 31 January 2007 at 12:41 :: NLP
Natural Language Toolkit (nltk) is an amazing library to play with natural language. I read an article about spelling correction, and I wanted to use nltk to code something useful. The purpose of the following code is to implement a similar "Did You Mean" feature used by search engines. Note: to run the code, you need to install nltk.
How it works:
- Learning. From a word dictionary or a corpus of sentences: create a dict containing for each word specialhash, the word and its number of occurrences.
- Testing. The tested word is hashed, we get the associated words in the learned db and print them sorted by number of occurrences.
Notes:
Porterclass create a stemmer object with astemfunction returning the word stem (removing morphological affixes). For examplestem('operation'),stem('operator'),stem('operating')all return the string'oper'.- we expect words and erroneous words collide in the dictionary to return the correct ones.
brown.raw()return a generator that iterates over the brown corpus sentences.
from nltk_lite.stem.porter import Porter from nltk_lite.corpora import brown import sys from collections import defaultdict import operator def sortby(nlist ,n, reverse=0): nlist.sort(key=operator.itemgetter(n), reverse=reverse) class mydict(dict): def __missing__(self, key): return 0 class DidYouMean: def __init__(self): self.stemmer = Porter() def specialhash(self, s): s = s.lower() s = s.replace("z", "s") s = s.replace("h", "") for i in [chr(ord("a") + i) for i in range(26)]: s = s.replace(i+i, i) s = self.stemmer.stem(s) return s def test(self, token): hashed = self.specialhash(token) if hashed in self.learned: words = self.learned[hashed].items() sortby(words, 1, reverse=1) if token in [i[0] for i in words]: return 'This word seems OK' else: if len(words) == 1: return 'Did you mean "%s" ?' % words[0][0] else: return 'Did you mean "%s" ? (or %s)' \ % (words[0][0], ", ".join(['"'+i[0]+'"' \ for i in words[1:]])) return "I can't find similar word in my learned db" def learn(self, listofsentences=[], n=2000): self.learned = defaultdict(mydict) if listofsentences == []: listofsentences = brown.raw() for i, sent in enumerate(listofsentences): if i >= n: # Limit to the first nth sentences of the corpus break for word in sent: self.learned[self.specialhash(word)][word.lower()] += 1 def demo(): d = DidYouMean() d.learn() # choice of words to be relevant related to the brown corpus for i in "birdd, oklaoma, emphasise, bird, carot".split(", "): print i, "-", d.test(i) if __name__ == "__main__": demo()
outputs:
birdd - Did you mean "birds" ? (or "bird") oklaoma - Did you mean "oklahoma" ? emphasise - Did you mean "emphasize" ? (or "emphasizes", "emphasizing") bird - This word seems OK carot - I can't find similar word in my learned db
It's a minimalist "Did you mean" implementation, google or yahoo use context and number of results to give you the best similar words. Also, here the specialhash method is an empirical way to reduce words. Rather than using a dictionary and specialhash, we may use simple list and a similarity method that computes a percentage of similarity between two words, but it would be less efficient to run the test method.



