There is a trade off even in the paper that was linked. You have to disallow dis...

darklajid · on June 11, 2015

I originally read the paper and fought with the implementation for OCR use cases. And that was stupid / short-sighted. My problem wasn't a big result set though, it was the size of the automaton at N >= 3..

What I usually need: Fuzzy results, with a quality/rating. Given a database of all customer names, give me the best matches for the OCR result. Same for 'all streets in Germany'.

Since these input words can be quite long, N = 3 (or higher) might still provide decent quality (3 errors in 15 characters might still be interesting), but the automaton explodes in size.

On top of that, ignoring the ~high~ number of edit operations I am interested in: I usually want the real distance, not a short-cut/cut-off. So not 'Give me all results that have less then 3 errors', but 'Give me the 100 best results, ordered by quality (errors/length)'.

Or - and obviously that's always a valid option - I just failed to make that work in any decent fashion.

jamra · on June 11, 2015

I did something similar to what you are looking for with my gocleo implementation. I create what I term the Jaccard(sp?) rating, which is the percent something is off by. github.com/jamra/gocleo

You can combine that with a thing like my LevenshteinTrie github.com/jamra/LevenshteinTrie (or if you can create a DAWG, it will be more compresssed)