November 22, 2008

Code

I put up some new code on the aptly titled section of my website:


http://vsedach.googlepages.com/code.html

Included are Jaro-Winkler and Levenshtein string similarity distance algorithms. Levenshtein is a general algorithm based on insertions/deletions/substitutions, while Jaro-Winkler is a more tweaked implementation specifically suited to short strings such as names. One area where the latter comes in handy is denormalizing manually entered records where for example salespersons' names may not be consistently entered. I found that Jaro-Winkler works best if you add the distance of the last name and the first name separately while giving the last name greater weight.



Also included are implementations of sparse vectors, and radix trees (which I blogged about before).

No comments: