Re: Suggestion for short text matching using dictionary

2008-06-27 Thread Grant Ingersoll
Yeah, I don't know if SecondString scales. Note that Lucene now has an implementation of Jaro-Winkler, which is a pretty good distance measure, so you may want to give that a try, plus if you see speedups, feel free to contrib a patch ;-) I'm wondering if Hadoop couldn't help w/ the scale

Re: Suggestion for short text matching using dictionary

2008-06-27 Thread climbingrose
Thanks Grant. I did try Secondstring before and found out that it wasn't particular good for doing a lot of text matching. I'm leaning toward the combination of Lucene and Secondstring. Googling around a bit, I came across this project http://datamining.anu.edu.au/projects/linkage.html. Looks inter

Re: Suggestion for short text matching using dictionary

2008-06-27 Thread Grant Ingersoll
below On Jun 27, 2008, at 1:18 AM, climbingrose wrote: Firstly, my apologies for being off topic. I'm asking this question because I think there are some machine learning and text processing experts on this mailing list. Basically, my task is to normalize a fairly unstructured set of sh

Suggestion for short text matching using dictionary

2008-06-26 Thread climbingrose
Firstly, my apologies for being off topic. I'm asking this question because I think there are some machine learning and text processing experts on this mailing list. Basically, my task is to normalize a fairly unstructured set of short texts using a dictionary. We have a pre-defined list of produc