Yeah, I don't know if SecondString scales. Note that Lucene now has
an implementation of Jaro-Winkler, which is a pretty good distance
measure, so you may want to give that a try, plus if you see speedups,
feel free to contrib a patch ;-)
I'm wondering if Hadoop couldn't help w/ the scale
Thanks Grant. I did try Secondstring before and found out that it wasn't
particular good for doing a lot of text matching. I'm leaning toward the
combination of Lucene and Secondstring. Googling around a bit, I came across
this project http://datamining.anu.edu.au/projects/linkage.html. Looks
inter
below
On Jun 27, 2008, at 1:18 AM, climbingrose wrote:
Firstly, my apologies for being off topic. I'm asking this question
because
I think there are some machine learning and text processing experts
on this
mailing list.
Basically, my task is to normalize a fairly unstructured set of
sh
Firstly, my apologies for being off topic. I'm asking this question because
I think there are some machine learning and text processing experts on this
mailing list.
Basically, my task is to normalize a fairly unstructured set of short texts
using a dictionary. We have a pre-defined list of produc