November 22, 2008

Code

I put up some new code on the aptly titled section of my website:


http://vsedach.googlepages.com/code.html

Included are Jaro-Winkler and Levenshtein string similarity distance algorithms. Levenshtein is a general algorithm based on insertions/deletions/substitutions, while Jaro-Winkler is a more tweaked implementation specifically suited to short strings such as names. One area where the latter comes in handy is denormalizing manually entered records where for example salespersons' names may not be consistently entered. I found that Jaro-Winkler works best if you add the distance of the last name and the first name separately while giving the last name greater weight.



Also included are implementations of sparse vectors, and radix trees (which I blogged about before).

No comments:

Post a Comment

Hi there! Thanks for taking the time to comment on my blog. To avoid spam, all messages are personally reviewed by me prior to being posted - don't worry if your message does not show up right away.

Note: Only a member of this blog may post a comment.