Thanks Grant. I did try Secondstring before and found out that it wasn't particular good for doing a lot of text matching. I'm leaning toward the combination of Lucene and Secondstring. Googling around a bit, I came across this project http://datamining.anu.edu.au/projects/linkage.html. Looks interesting but the implementation is in Python though. I think they use Hidden Markov Model to label training data then matching records probalistically.
On Fri, Jun 27, 2008 at 10:12 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > below > > > > On Jun 27, 2008, at 1:18 AM, climbingrose wrote: > > Firstly, my apologies for being off topic. I'm asking this question >> because >> I think there are some machine learning and text processing experts on >> this >> mailing list. >> >> Basically, my task is to normalize a fairly unstructured set of short >> texts >> using a dictionary. We have a pre-defined list of products and >> periodically >> receive product feeds from various websites. Basically, our site is >> similar >> to a shopping comparison engine but on a different domain. We would like >> to >> normalize the products' names in the feeds to using our pre-defined list. >> For example: >> >> "Nokia N95 8GB Black" ---> "Nokia N95 8GB" >> "Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB" >> >> My original idea is to index the list of pre-defined names and then query >> the index using the product's name. The highest scored result will be used >> to normalize the product. >> >> The problem with this is sometimes you get wrong matches because of noise. >> For example, "Black Nokia N95, 8GB + Free bluetooth headset" can match >> "Nokia Bluetooth Headset" which is desirable. >> > > > I assume you mean "not desirable" here given the context... > > Your approach is worth trying. At a deeper level, you may want to look > into a topic called "record linkage" and an open source project called > Second String by William Cohen's group at Carnegie Mellon ( > http://secondstring.sourceforge.net/) which has a whole bunch of > implementations of fuzzy string matching algorithms like Jaro-Winkler, > Levenstein, etc. that can then be used to implement what you are after. > > You could potentially use the spell checking functionality to simulate some > of this a bit better than just a pure vector match. Index your dictionary > into a spelling index (see SOLR-572) and then send in spell checking > queries. In fact, you probably could integrate Second String into the spell > checker pretty easily since one can now plugin the distance measure into the > spell checker. > > You may find some help on this by searching http://lucene.markmail.org for > things like "record linkage" or "record matching" or various other related > terms. > > Another option is to write up a NormalizingTokenFilter that analyzes the > tokens as they come in to see if they match your dictionary list. > > As with all of these, there is going to be some trial and error here to > come up with something that hits most of the time, as it will never be > perfect. > > Good luck, > Grant > > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > -- Regards, Cuong Hoang