November 22, 2008


I put up some new code on the aptly titled section of my website:

Included are Jaro-Winkler and Levenshtein string similarity distance algorithms. Levenshtein is a general algorithm based on insertions/deletions/substitutions, while Jaro-Winkler is a more tweaked implementation specifically suited to short strings such as names. One area where the latter comes in handy is denormalizing manually entered records where for example salespersons' names may not be consistently entered. I found that Jaro-Winkler works best if you add the distance of the last name and the first name separately while giving the last name greater weight.

Also included are implementations of sparse vectors, and radix trees (which I blogged about before).

No comments: