While these specific challenges are trivial now because of increased memory and CPU, user expectations have increased too, requiring more advanced features which are not trivial to implement across different platforms. Users now expect good auto-suggestion when doing text entry on mobile platforms and laugh when it fails ( http://damnyouautocorrect.com/ ). I just launched an iPad app to improve communication for speech-disabled users and nearly everyone I talked to wanted word-completion and next-word suggestion, even when offline.
While it seems trivial for Google or a database backed server to provide real-time intelligent suggestions (e.g. suggest 5 words that follow: Harry), implementing such a feature on iPad took me over a month even though I knew exactly what I wanted to make and had all the necessary data beforehand. I had a list of 1 million words and phrases with frequency of usage (16mb of text data) and wanted to suggest 3-7 words in real-time as the user typed each letter. And implementing this on iPad required quite a bit of engineering and optimization.
I know it's HN folklore to claim "Trivial! Would do over a weekend!", but please...
Doing a half-decent production spell checker is STILL a major feat. Same as "just crawling the web" (further down the discussion). Both require problem understanding and engineering you can't see and appreciate at a glance.
And no, looking up individual words in some predefined dictionary doesn't qualify as half-decent spell checking, especially for non-English languages. Spelling correction is another step.
That's not the point of the article. He's not talking about writing a state-of-the-art spelling-and-grammar checker.
> And no, looking up individual words in some predefined dictionary doesn't qualify as half-decent spell checking,
Well, but the author is talking about that problem! Even if you don't consider that real spell-checking, his point still stands. Let's define crappy-spell-checking as "looking up individual words in some predefined dictionary"; that problem used to be hard and now it's very easy, as in, you could write one in 15 minutes using Python.
I read the point of the article to compare "spell-checking then (80s) and now", whereas others read it more along the lines of "looking up static English words then and now". Your nickname sounds japanese, but I assume you're talking about English as well, with those 15 minutes.
Maybe it's not fair to cite the work of super heroes, but Peter Norvig wrote a spelling corrector in 21 lines of Python: http://norvig.com/spell-correct.html
It is fair, it is a first step in that direction :-)
That corrector has no context though, so it will not correct misspelled words that happen to come out as other words (incl. the "Their coming too sea if its reel.").
This is especially important on the web, where pretty much every conceivable word is "correct" (a name of a company or otherwise). False positives are costly, and context disambiguation critical. Let the fun begin.
Incidentally, the linked article by Peter Norvig also mentions that adding context is the best way to improve spelling correction. If you think this claim is so controversial, explain yourself, don't just downvote please.
People want their documents (or queries) free of spelling errors. That is their pain and that is the challenge.
The example sentence has misspelled words, hence is within the domain of spell checking. This type of misspelled words are called "homonyms", which are one very common spelling problem. The academic terminology is uninteresting to most users, however.
Or if you mean to posit that "spell checking really means looking up if a word exists in a static dictionary of English", then yes, that's easy and solved, no argument there.
> The example sentence has misspelled words, hence is within the domain of spell checking.
No, it doesn't . It is grammatically incorrect, but all the words are spelled correctly. You're definitely talking about a grammar checker, not a spelling checker.
If you were a teacher marking a student's paper, you would label those as spelling errors, not a grammar errors. The reason is that a different spelling of the words would create the intended sentence. No grammatical variation will do that (reorders, conjugating differently, etc).
In your view, "spelling check" is applied to individual words, to see if they appear in a fixed dictionary (see my other comment about difficulties with choosing this "correct" dictionary in reality, though). I can imagine this view is inviting for programmers, because it's easy to implement, but I doubt anyone else finds useful a definition that says there are no misspellings in "Their coming too sea if its reel."
In my view, "spell check" applies to utterances and roughly means "all words are spelled as per the norm of the language; I can send this document to my boss/customer and they won't laugh at my spelling." It is a more user-centric view, and more complex too, because it covers intent and norms, as opposed to the comforting lookup table for a few hand-picked strings. Modern spell checkers make heavy use of statistical analysis of large text corpora, to reasonably approximate context needed to model such intent.
Once we agree on the terminology, I believe we are in agreement, so let's not split hairs. The "correctly spelled" sentence under question comes from the Wikipedia article on spell checking, by the way.
I can imagine this view is inviting for programmers, because it's easy to implement, but I doubt anyone else finds useful a definition that says there are no misspellings in "Their coming too sea if its reel."
I realize this is just a debate over a definition, so it's not very meaningful, but the fact is, you're on the wrong side of the common definition here. I just tested, and every spell checker that I just checked (Chrome, Firefox, MS Word, TextMate, TextEdit - perhaps some of these rely on the same underlying engine, I'm not sure?) accepts that sentence as not having a spelling error, so clearly there's some use for such a definition.
The grammar checkers, on the other hand, don't like it, but by changing "their" to "they're", they all accept it, despite the fact that it's still garbage. So don't overestimate how good "modern" spell checkers are...though there may be techniques to do a better job, they're not in common use, at least in the most common spell-checking contexts (which, lets be honest, pretty much means MS Word).
There is no question whatever: the sentence contains misspelled words. "They're" is misspelled as "Their", "to" as "too", "see" as "sea", etc.
There is also no non-words. It happens that today's spelling checkers are generally just non-word detectors. This doesn't mean that anyone defines "misspelled" to mean "misspelled in such a way that the result is not a word at all".
And yes, a non-word detector is still very useful, and it's much much easier to make than something that also determines reliably when words are misspelled in ways that produce other words, so there's lots of software out there (perhaps essentially all of it) that contains only a non-word detector.
I think it's perfectly reasonable to define "spelling checker" to include mere non-word detectors. Or for that matter non-common-word detectors. (I expect most spelling checkers will reject "hight", and very sensibly because if someone types that they probably meant "high" or "height" or something -- but it's a perfectly good word, albeit a rare and archaic one.) But there's no way it's correct to say that "sea" isn't misspelled in that sentence merely because the mistake happens to have produced something that's an English word.
The sentence contains misspelled words. The errors in those words happen to make them collide with other existing words. That doesn't change the category of the error.
No, it contains correctly spelled words used in place of other desired words.
"I sea your eyes", "I she your eyes"
If you want to correct these kind of errors, you must now a lot about natural language. Also, the above are trivial cases. There are tons of edge cases and far more difficult distinctions. Here's an amusing one, that can lead to "Microsoft paperclip" like interactions:
"I gave him the new pink dress as a present" =>
"I gave her the new pink dress as a present"
No, idiotic spell-checker, I do mean him. My friend is a cross-dresser, shut up and let me type.
Such a spellchecker would also be useless for poetry. And if you find poetry obscure, so it doesn't really matter, then such a spellchecker would also be useless for irony. Suddenly, you lose all the hipsters from your potential users (except if they start using it ironically).
Anyway, no spell checker in widespread use attempts this --and it's probably a very hard nut to crack, and probably uncrackable in the general case.
> No, it contains correctly spelled words used in place of other desired words.
If you intend to write a word meaning "actually existing as a thing" and spell it "reel", it is monstrously unlikely that you genuinely thought a word meaning "a cylinder on which flexible materials can be wound" was a suitable substitute. If you did, that would be the use of a correctly spelled (but incorrectly selected) word in place of a desired word: a grammatical error, in other words. However, if you just pick the wrong letters to construct a phoneme, as is almost certainly the case here... yep, that's a spelling error.
To put it another way, imagine that the word "reel" didn't actually exist, and I make precisely the same error, substituting an 'e' for an 'a'. All of a sudden, by your argument, what was a grammatical error is now a spelling error. But the mistake I made hasn't changed, so that makes no sense.
> If you want to correct these kind of errors, you must now a lot about natural language.
>...
> Anyway, no spell checker in widespread use attempts this
So? The category of error doesn't change with how difficult it is to fix.
Hey, here's a chance to invent some new terminology!
I would say that Norvig's corrector is a first-order spelling corrector, since it works within the context of a single word.
A second-order corrector would take into account the word before or after it to choose the spelling that is more likely to make sense. ("Their coming" would suggest a correction to "They're coming")
Third-, fourth-, (and so on) order expands the distance of words considered.
It's valid, but less common, and that's why I specifically chose the wording "suggest a correction" instead of "correct". Spell checkers are still no substitute for actual thought.
No, it's spell-checking. Use of the word "reel" in this sentence, for instance, is definitely a spelling error. There's no grammatically valid form of the word "real" spelled with two 'e's. The fact that "reel" happens to be a valid word doesn't mean that its presence in the sentence is due to a grammatical error - it's just an accidental collision.
That no spell checker might be able to catch this specific class of error doesn't change the type of error it is.
spell checking for languages that sport a high rate of agglutination where you can't (feasibly) enumerate every possible form of a stem is still a major feat. languages like turkish, finnish and hungarian are prime examples of this.
What makes working in a finite, very limited set of resources so rewarding is that those limitations turn mere programming into art.
You can't have art unless you constrain yourself somehow. Some people paint with dots only and some people express themselves in line art. If they allowed themselves any imaginable method that is applicable, they wouldn't be doing art. They could just take a photograph of a setting, and that photograph wouldn't say a thing to anyone.
Endless bit-twiddling and struct packing may turn trivial methods into huge optimization-ridden hacks and not get you too far vertically but given only few resources, those hacks are required to turn the theoretic approach into a real application that you can actually do useful work with. Those hacks often exhibit ingenious thinking and many examples of that approach art—and the best definitely are. And any field where ingenious thinking is required will push the whole field forward.
Similarly, for example, using a Python set as a rudimentary spell-checker is fast, easy, and convenient but it's no hack because it requires no hacking. It's like taking that photograph, or using a Ferrari to reverse from your garage out to the street and then driving it back. Which ingenious tricks are you required to accomplish that? None.
The bleeding edge has simply moved and it must lie somewhere these days, of course. Maybe computer graphics—although it has always demonstrated the capability to lead the bleeding edge so there's actually no news there. The fact is that the bleeding edge is more scattered. Early on, every computer user could feel and sense the bleeding edge because it revolved around tasks so basic that you could actually measure with your own eyes. Similarly, even a newbie programmer would face the same limitations early on and learn about how others had done it. Now you can stash gigabytes of data into your heap without realizing that you did, and wonder why the computer felt sluggish for a few seconds. Or how would you compare two state of the art photorealistic, interactive real-time 3D graphics demos? There's so much going on beyond the curtains that it's difficult to evaluate the works without having extensive domain knowledge in most the technologies used.
Findind the bleeding edge has become a field in itself.
I understand your general thesis, but your statements about art just seem completely off.
> You can't have art unless you constrain yourself somehow.
> They could just take a photograph of a setting, and that photograph wouldn't say a thing to anyone.
While I realize that art is subjective, I'm very surprised that you would put these conditions around what you consider to be art. Especially since it seems to be a condemnation of a large subset of photography.
Just curious, but would you like to give me a couple of examples of recognized good art that isn't constrained by some rule, method, technique, or approach?
A large subset of photography isn't art. In fact, most everything people create isn't art per se—if it were, there wouldn't be good art nor bad art, just art and lots and lots of it. Spend a few hours on some photo-sharing site and see what people shoot. They're photographs, but rarely art.
But there are grades of art. Look at this search: http://goo.gl/2mLVI — a thousand sunset pictures, while maybe pretty, aren't generally art and not because it's the same sun in each picture. Most of these pictures have nothing to say. Now, some object lit by the sunset or silhouetted against it gives a lot more potential to be art. A carefully crafted study of a sunset in the form of a photograph can be art, but it requires finding certain constraints first, finding a certain angle that makes the photograph a message, and eventually conveying through the lens something that makes the viewer stop for a moment, to give an idea, to give a feeling, to give a confusion.
I'm genuinely mystified by your response. Art is an entirely subjective endeavor. You seem to think that you can decide what is and is not art. Frankly, you're not qualified for that task (read: no one is).
> A large subset of photography isn't art
> a thousand sunset pictures, while maybe pretty, aren't generally art
Seriously? Because you get to decide? Your response just seems incredibly egocentric. You can define what art means to you all day long, but you can not define what art means to everyone.
I don't think so. Photography as art has never been just pointing and shooting. You have to find the right natural lightning, or the right facial expression, or the right color composition, etc. All those restrictions make the photography an art that no anyone can achieve (at least, without extensive training and effort).
I strongly disagree that extensive training is required to create art. Art comes in many forms, and it's a quite myopic to imply that only professionals can create art.
Interesting. But maybe a better title would be: A Spellchecker Used To Require A Major Feat of Software Engineering.
Some may say something is lost and resources wasted - taken for granted as we now brute force our way through such problems. Surely going backwards?
but now we are, a million times a second; free to disambiguate a word's meaning, check its part of speech, figure out whether its a named entity or an address, figure out if it is a key topic in the document we are looking at and write basic sentences. I agree. That is progress.
Some may say something is lost and resources wasted - taken for granted as we now brute force our way through such problems. Surely going backwards?
I think it is progress: Yes, we can brute force today through the problems that were a feat a decade or two ago. But: Not having to solve those problems frees a lot of time. Time that can be spent on problems that are a major feat today.
It's a shame that solutions like Bob Morris's no dictionary spell checker[1] are left languishing just because we all have fast computers.
Getting better at a problem in one domain can spin off benefits in others.
Spelling is part of language, and language is something that computers are really bad at. Brute force helps a bit with that (auto-correct; siri;) but better understanding would be cool.
Morris, Robert & Cherry, Lorinda L, 'Computer detection of typographical errors', IEEE Trans Professional Communication, vol. PC-18, no.1, pp54-64, March 1975.
It seems the author of that article doesn't know that spell checking, translating and understanding text are actually major features of pretty new software, too. I don't know how the spellcheckers are for English, but in German they really suck since years. You can't just let MS Word autocorrect your text, because the spell checker will insert more errors than it can remove. So when I want to find out, how to spell a word correctly in German, I just use the Google search bar. Why? Because they consider spell checking a hard task TODAY!
Spell checking is not about comparing a list of words to what the user wrote and tell him what didn't match. It's much more about understanding the users intention and helping him shaping that intention into an officially recognised grammar/spelling. Example: "then" is a correct word. But in the context of "Google's spell check is better then Word's" "then" is actually wrong. (Google also doesn't tell you about that mistake but the first search result actually contains a "than", which is recognised as what you actually meant)
I hope I could make it clear, why I think it still is a major feature to have the best spell checker and I think, the title really should be revised.
Yes. This whole article struck me as the spell checking equivalent to "I can build Map Reduce in 5 lines of language N" meme that went around a while ago.
"""It seems the author of that article doesn't know that spell checking, translating and understanding text are actually major features of pretty new software, too. """
Actually it seems that everybody reading the article got the same WRONG impression.
The author full well knows what it takes to do a i18n full-featured spell checker.
That is BESIDE the point.
What he says is that doing a basic (lame ass) spell-checker in the 80s used to be a MAJOR undertaking, and now, doing EXACTLY THE SAME is trivial.
His point is not about spell-checking.
It's about modern OS, language, CPU, HD and memory conveniences, vs what one had to deal with in the olden days.
This relates to Fred Brooks's 'No Silver Bullet': the 'accidental complexity' of coping with hardware constraints has given way to the 'essential complexity' of writing an algorithm to check someone's spelling.
I know Bloom Filters came in handy for this sort of work but I'd love to see some other data structures and algorithms that were developed in the 80s to deal with limited memory.
You might like 'How to Fit a Large Program Into a Small Machine' (http://www.csd.uwo.ca/Infocom/Articles/small.html), where some of the founders of Infocom talk about the approaches they used to fit Zork (which required a massive 1MB of memory) onto microcomputer systems which had 32K and a floppy disk drive.
See also the Digital Antiquarian's fascinating (and reasonably technical) blog on the origins of Infocom: http://www.filfre.net/tag/infocom/
See the 1988 paper "The world's fastest Scrabble program" by Andrew W. Appel (Princeton) and Guy J. Jacobson (Carnegie-Mellon).
They used a data structure they dubbed a DAWG (directed acyclic word graph) that was a trie for common start patterns with words and also collapsed together common endings (every "tion" ending word ended in same portion f the graph. As I recall they used 5 bits to store ASCII uppercase characters and either 11bits or 19bits to address nodes in the graph leading to a data structure that compressed words as well as any lzw type compression with the advantage that it was directly readable with respect to word extraction.
While the paper was published in 1988, the program was written and being used much earlier in '82 or '83.
Yes, and "Programming Pearls" has a nice chapter (13.8) on how doug mcilroy fit the spell dict into less than 64k (
http://code.google.com/p/unix-spell/).
In the 80's I got the gist of spellchecker algorithms.
An Apple II spellchecker that I used a few times seemed to spell-check using brute force means with the dictionary filling an separate 143K floppy disk. It was a somewhat slow batch-like one document-at-time operation, but it was good enough to be useful.
As a programmer in the 80's, I saw a spellchecker that used a small hash table to tell whether a word was misspelled or not. The hash table could be made even smaller if we limited the dictionary to about 16000 of the more common words. If I remember correctly, the smallest hash table was generated as an array of 16 bit integers in the source code before compiling and it's small footprint kept in memory of the built program. Once a word was identified as a misspelling, using a very fast hash look-up, suggestions of words could be pulled from the actual dictionary on disk. Using the hash, there was the possibility of false spellings. Maybe this was a Bloom Filter.
Morris, Robert & Cherry, Lorinda L, 'Computer detection of typographical errors', IEEE Trans Professional Communication, vol. PC-18, no.1, pp54-64, March 1975.
Progress for computing, absolutely. Progress for software engineering? Questionable. Are the clever solutions applied in legacy spellcheckers obsolete simply because sufficient brute force is now available on everyday machines to solve everyday problems?
The problem with a lot of clever solutions is that they often obfuscate the code. This makes it harder for someone to figure out exactly what and how you are doing something. The clever solutions were necessary back then because cycles, memory, and storage were at a premium. Today they are mostly premature or unnecessary optimizations.
As an example, a few months ago a young developer I worked with implemented a set of attributes on a table as a bit field. He had just read an article about then and did it because it sounded cool. It made running reports against the data and doing imports via plain SQL impossible without writing a bunch of extra functions. His clever solution that saved about $0.0001 worth of disk ended up costing several hours of developer time because he didn't just a join table and a foreign key.
Yes. Engineering is constraint management. You're constantly trading off one thing for another: speed, flexibility, maintenance, cost, durability, etc. When one thing fails to be a relevant constraint, it's ok to focus on the other things that are of more immediate concern.
There are always questions on the edge of computational complexity. In 30 years: "Indexing all of the individual web pages in the world used to be a tough challenge. Network speeds were in the hundreds of megabits, and..."
I would be surprised if that analogy holds up. I have heard the Google guys state that the growth of the web easily outpaces advances in computing and bandwidth. The English language, not so much...
You're missing the point. "Growth of the language" wasn't the problem for spell-check, it was the size of the language combined with limited space. Computational breakthroughs solved that problem. The parent to my post implied that solving the 'space' problem in a different way (i.e. by having more space) somehow harmed software engineering, which it hasn't. Ergo, the suggestion that the web is outpacing computing and bandwidth all the more supports my point, since that problem will be persistent.
> In 30 years: "Indexing all of the individual web pages in the world used to be a tough challenge."
Eh, that needs qualification to make sense. Just crawling and trivially tokenizing the web counts as "indexing all the individual web pages" but that's never been such a monstrous task.
Having results fresh to-the-day/hour/minute is a better example of something that will probably be looked at is child's play even though that's a very new, very computationally expensive development in search.
Designing a high quality and compact spell checker for agglutinative languages like Turkish or Hungarian is definitely a non trivial and challenging problem.
It's not that easy today either. Loading a list of words in a hashtable will work for..Chinese (pinyin), but most languages have declensions and conjugation. And you want to validate both "cat" and "cats", both "walk", "walking" and "walked" and the word list normally wouldn't contain those. (haven't checked the English ones, but it certainly doesn't for languages with more complex inflection).
Yes, you don't have the memory complications that are really hard, but you still need to think. Get a proper data-structure (Trie, for ex), and fill it with all forms of the words.
I had a decent spell checker on my Apple //c - 128K RAM (64K commonly usable) and had to fit on 140KB floppy. While slow it worked better than about 10 years of MS Word.
While it seems trivial for Google or a database backed server to provide real-time intelligent suggestions (e.g. suggest 5 words that follow: Harry), implementing such a feature on iPad took me over a month even though I knew exactly what I wanted to make and had all the necessary data beforehand. I had a list of 1 million words and phrases with frequency of usage (16mb of text data) and wanted to suggest 3-7 words in real-time as the user typed each letter. And implementing this on iPad required quite a bit of engineering and optimization.