How to Write a Spelling Corrector (2007)

StefanKarpinski · on March 28, 2014

When I found myself writing a spell corrector for Etsy's search a couple years ago, I thought "yawn, solved problem". As it turns out, it's not as solved as this blog post makes it seem. The basics are correct: you need a language model and an error model. The problem is that both approaches to that presented are pretty naïve. Indeed, the resulting accuracy of 67% is completely unacceptable for an real spell corrector – an earlier version of this blog post had an evaluation bug that made it seem that the accuracy was much higher – over 90% if I recall correctly.

What's wrong with the language model? Counting occurrences in corpuses of text is not really good enough. There are very rare words that are nevertheless real words, and there are misspellings of common words (even in books) that are far more common than many actual words. You can easily find yourself in a situation where a correctly spelled rare word is corrected to a common misspelling of some other, much more common word.

What's wrong with the error model? Edit distance just isn't a very good error model. There are many common misspellings that aren't very close via additions or deletions. There are also one letter edits that completely change the meaning of something. These should be treated as a much bigger errors than modifications that have no effect on meaning.

For special contexts like Etsy where words like "jewelry" (also correctly spelled as "jewellery" in the UK), Swarovski, and "ACEO" are extremely common, you really need a custom language model. I just wanted to put this out there lest people be under the misapprehension that spell correction is quite as easy as this blog post makes it out to be.

abecedarius · on March 28, 2014

http://norvig.com/ngrams/ has a fancier spell-corrector and some more discussion.

abhgh · on March 29, 2014

If you are doing a production grade spellchecker (or for that matter, anything ML) I would suggest that you spend some time getting familiar with the literature survey (Google Scholar is your friend). I have done various implementations over time (and also, as it happens, specifically a spellchecker, but that was not meant to be production grade), and I find this to be the most rewarding way in the long run.

StefanKarpinski · on March 30, 2014

Yes, reading the literature is often helpful.

coldcode · on March 28, 2014

Argh, another blog where there are no posting dates. Is it that hard to tell people when your post was written? The post starts off with "In the past week..." but we have no idea if this was written 10 years ago or recently. It may be interesting content but without a context it's confusing to know if this is new or old.

ColinWright · on March 28, 2014

Readers might be interesting in seeing comments from other HN readers from previous discussions:

https://news.ycombinator.com/item?id=10967

https://news.ycombinator.com/item?id=42587

https://news.ycombinator.com/item?id=665544

https://news.ycombinator.com/item?id=2034981

There are more ...

========

Other submissions without comment:

https://news.ycombinator.com/item?id=1225574

https://news.ycombinator.com/item?id=2036600

https://news.ycombinator.com/item?id=4609321

msoad · on March 28, 2014

The easiest way to suggest correction is to have a list of correct words and compare it with Levenshtein distance[1].

To do it really good, you have to look for context (previous words) and add learning behavior. This is done nicely in Swiftkey keyboard for android.

The next step is looking at characters close to mistyped character in keyboard layout. This is also done in Swiftkey.

[1]http://en.wikipedia.org/wiki/Levenshtein_distance

mrcactu5 · on March 28, 2014

I remember the spell-checker being harder than it looks. You have to train a Naive Bayes classifier to "recognize" misspelled words.

I can't access the link right now... OK: http://web.archive.org/web/20140217174744/http://norvig.com/...

pwf · on March 28, 2014

Is there a bug in the train function? Using 'lambda: 1' for the default dict along with '+=' means that the first time a feature is encountered, the value is set to 2.

  In [1]: from collections import defaultdict

  In [2]: d = defaultdict(lambda: 1)

  In [3]: d['foo'] += 1

  In [4]: d['foo']
  Out[4]: 2

jonas21 · on March 28, 2014

I think the code is correct. To implement smoothing, you want to add 1 to the count of every word, regardless of whether it appears in the training data or not.

That is to say, a word that appears once should get a count of 2, and word that doesn't appear at all should get a count of 1.

pwf · on March 28, 2014

That makes sense, thanks!

I should learn to read...

justinpombrio · on March 28, 2014

This is a perfect example of where a (short) code comment would be helpful. The "lambda: 1" a notable piece of code, but it's hard to tell that at a glance.

hnriot · on March 28, 2014

no it's not, it's perfectly clearly documented in code. The smoothing is common practice in this area of software engineering.

dljsjr · on March 28, 2014

And yet this is written for people who aren't familiar with this area of software engineering. It's stated so right at the top of the article.

hnriot · on March 29, 2014

nonsense, any software dev that can't follow a lambda that adds 1 should be taken outside and shot.

When Norvig says talks of regular folks, he means people with 1/10th his IQ, which is still the top 1%. Norvig is so far out there on the IQ scale that I find it funny when some noob says he's found a bug!

gabemart · on March 28, 2014

Google cached version: http://webcache.googleusercontent.com/search?q=cache%3Ahttp%...

init0 · on March 28, 2014

awk needed just 15 lines, where as java needed 372!

jonas21 · on March 28, 2014

Well... to be fair, there's also a 35-line Java version there.

dllthomas · on March 28, 2014

awk is for messing with text, of course it's going to be terser.