When I found myself writing a spell corrector for Etsy's search a couple years ago, I thought "yawn, solved problem". As it turns out, it's not as solved as this blog post makes it seem. The basics are correct: you need a language model and an error model. The problem is that both approaches to that presented are pretty naïve. Indeed, the resulting accuracy of 67% is completely unacceptable for an real spell corrector – an earlier version of this blog post had an evaluation bug that made it seem that the accuracy was much higher – over 90% if I recall correctly.
What's wrong with the language model? Counting occurrences in corpuses of text is not really good enough. There are very rare words that are nevertheless real words, and there are misspellings of common words (even in books) that are far more common than many actual words. You can easily find yourself in a situation where a correctly spelled rare word is corrected to a common misspelling of some other, much more common word.
What's wrong with the error model? Edit distance just isn't a very good error model. There are many common misspellings that aren't very close via additions or deletions. There are also one letter edits that completely change the meaning of something. These should be treated as a much bigger errors than modifications that have no effect on meaning.
For special contexts like Etsy where words like "jewelry" (also correctly spelled as "jewellery" in the UK), Swarovski, and "ACEO" are extremely common, you really need a custom language model. I just wanted to put this out there lest people be under the misapprehension that spell correction is quite as easy as this blog post makes it out to be.
If you are doing a production grade spellchecker (or for that matter, anything ML) I would suggest that you spend some time getting familiar with the literature survey (Google Scholar is your friend). I have done various implementations over time (and also, as it happens, specifically a spellchecker, but that was not meant to be production grade), and I find this to be the most rewarding way in the long run.
Argh, another blog where there are no posting dates. Is it that hard to tell people when your post was written? The post starts off with "In the past week..." but we have no idea if this was written 10 years ago or recently. It may be interesting content but without a context it's confusing to know if this is new or old.
Is there a bug in the train function? Using 'lambda: 1' for the default dict along with '+=' means that the first time a feature is encountered, the value is set to 2.
In [1]: from collections import defaultdict
In [2]: d = defaultdict(lambda: 1)
In [3]: d['foo'] += 1
In [4]: d['foo']
Out[4]: 2
I think the code is correct. To implement smoothing, you want to add 1 to the count of every word, regardless of whether it appears in the training data or not.
That is to say, a word that appears once should get a count of 2, and word that doesn't appear at all should get a count of 1.
This is a perfect example of where a (short) code comment would be helpful. The "lambda: 1" a notable piece of code, but it's hard to tell that at a glance.
nonsense, any software dev that can't follow a lambda that adds 1 should be taken outside and shot.
When Norvig says talks of regular folks, he means people with 1/10th his IQ, which is still the top 1%. Norvig is so far out there on the IQ scale that I find it funny when some noob says he's found a bug!
What's wrong with the language model? Counting occurrences in corpuses of text is not really good enough. There are very rare words that are nevertheless real words, and there are misspellings of common words (even in books) that are far more common than many actual words. You can easily find yourself in a situation where a correctly spelled rare word is corrected to a common misspelling of some other, much more common word.
What's wrong with the error model? Edit distance just isn't a very good error model. There are many common misspellings that aren't very close via additions or deletions. There are also one letter edits that completely change the meaning of something. These should be treated as a much bigger errors than modifications that have no effect on meaning.
For special contexts like Etsy where words like "jewelry" (also correctly spelled as "jewellery" in the UK), Swarovski, and "ACEO" are extremely common, you really need a custom language model. I just wanted to put this out there lest people be under the misapprehension that spell correction is quite as easy as this blog post makes it out to be.