Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to Write a Spelling Corrector (2007) (norvig.com)
78 points by hashx on March 28, 2014 | hide | past | favorite | 19 comments


When I found myself writing a spell corrector for Etsy's search a couple years ago, I thought "yawn, solved problem". As it turns out, it's not as solved as this blog post makes it seem. The basics are correct: you need a language model and an error model. The problem is that both approaches to that presented are pretty naïve. Indeed, the resulting accuracy of 67% is completely unacceptable for an real spell corrector – an earlier version of this blog post had an evaluation bug that made it seem that the accuracy was much higher – over 90% if I recall correctly.

What's wrong with the language model? Counting occurrences in corpuses of text is not really good enough. There are very rare words that are nevertheless real words, and there are misspellings of common words (even in books) that are far more common than many actual words. You can easily find yourself in a situation where a correctly spelled rare word is corrected to a common misspelling of some other, much more common word.

What's wrong with the error model? Edit distance just isn't a very good error model. There are many common misspellings that aren't very close via additions or deletions. There are also one letter edits that completely change the meaning of something. These should be treated as a much bigger errors than modifications that have no effect on meaning.

For special contexts like Etsy where words like "jewelry" (also correctly spelled as "jewellery" in the UK), Swarovski, and "ACEO" are extremely common, you really need a custom language model. I just wanted to put this out there lest people be under the misapprehension that spell correction is quite as easy as this blog post makes it out to be.


http://norvig.com/ngrams/ has a fancier spell-corrector and some more discussion.


If you are doing a production grade spellchecker (or for that matter, anything ML) I would suggest that you spend some time getting familiar with the literature survey (Google Scholar is your friend). I have done various implementations over time (and also, as it happens, specifically a spellchecker, but that was not meant to be production grade), and I find this to be the most rewarding way in the long run.


Yes, reading the literature is often helpful.


Argh, another blog where there are no posting dates. Is it that hard to tell people when your post was written? The post starts off with "In the past week..." but we have no idea if this was written 10 years ago or recently. It may be interesting content but without a context it's confusing to know if this is new or old.



The easiest way to suggest correction is to have a list of correct words and compare it with Levenshtein distance[1].

To do it really good, you have to look for context (previous words) and add learning behavior. This is done nicely in Swiftkey keyboard for android.

The next step is looking at characters close to mistyped character in keyboard layout. This is also done in Swiftkey.

[1]http://en.wikipedia.org/wiki/Levenshtein_distance


I remember the spell-checker being harder than it looks. You have to train a Naive Bayes classifier to "recognize" misspelled words.

I can't access the link right now... OK: http://web.archive.org/web/20140217174744/http://norvig.com/...


Is there a bug in the train function? Using 'lambda: 1' for the default dict along with '+=' means that the first time a feature is encountered, the value is set to 2.

  In [1]: from collections import defaultdict

  In [2]: d = defaultdict(lambda: 1)

  In [3]: d['foo'] += 1

  In [4]: d['foo']
  Out[4]: 2


I think the code is correct. To implement smoothing, you want to add 1 to the count of every word, regardless of whether it appears in the training data or not.

That is to say, a word that appears once should get a count of 2, and word that doesn't appear at all should get a count of 1.


That makes sense, thanks!

I should learn to read...


This is a perfect example of where a (short) code comment would be helpful. The "lambda: 1" a notable piece of code, but it's hard to tell that at a glance.


no it's not, it's perfectly clearly documented in code. The smoothing is common practice in this area of software engineering.


And yet this is written for people who aren't familiar with this area of software engineering. It's stated so right at the top of the article.


nonsense, any software dev that can't follow a lambda that adds 1 should be taken outside and shot.

When Norvig says talks of regular folks, he means people with 1/10th his IQ, which is still the top 1%. Norvig is so far out there on the IQ scale that I find it funny when some noob says he's found a bug!



awk needed just 15 lines, where as java needed 372!


Well... to be fair, there's also a 35-line Java version there.


awk is for messing with text, of course it's going to be terser.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: