Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Soundex was designed for a very specific purpose. It is very culture-dependent and, in my experience, is working very poorly in most practical applications related to matching names.


Soundex works fine as part of a larger process, especially when combined with other kinds of normalization. You need a human to make the final judgement on matches. In the course of a year I have to match 100k names to names in a database of 850k people. Soundex is great for flagging names that might match, or for flagging matches that might be incorrect. I use Soundex in combination with NYSIIS, double metaphone, lists of normally confused names, etc. Before I created our current matching process, we were creating approximately 5-10k duplicate records a year.

Quick edit: Our data sources are handwritten and typed names, often transcribed by a second party. So algorithms that detect transposition errors as well as phonetic errors are really helpful.


I've used a Python implementation of soundex() in a production data mining app to help resolve things like ECQUADOR->ECUADOR. Worked well (as an entity resolution mechanism among many others).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: