Thanks Grant. I did try Secondstring before and found out that it wasn't
particular good for doing a lot of text matching. I'm leaning toward the
combination of Lucene and Secondstring. Googling around a bit, I came across
this project http://datamining.anu.edu.au/projects/linkage.html. Looks
interesting but the implementation is in Python though. I think they use
Hidden Markov Model to label training data then matching records
probalistically.

On Fri, Jun 27, 2008 at 10:12 PM, Grant Ingersoll <[EMAIL PROTECTED]>
wrote:

> below
>
>
>
> On Jun 27, 2008, at 1:18 AM, climbingrose wrote:
>
>  Firstly, my apologies for being off topic. I'm asking this question
>> because
>> I think there are some machine learning and text processing experts on
>> this
>> mailing list.
>>
>> Basically, my task is to normalize a fairly unstructured set of short
>> texts
>> using a dictionary. We have a pre-defined list of products and
>> periodically
>> receive product feeds from various websites. Basically, our site is
>> similar
>> to a shopping comparison engine but on a different domain. We would like
>> to
>> normalize the products' names in the feeds to using our pre-defined list.
>> For example:
>>
>> "Nokia N95 8GB Black" ---> "Nokia N95 8GB"
>> "Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB"
>>
>> My original idea is to index the list of pre-defined names and then query
>> the index using the product's name. The highest scored result will be used
>> to normalize the product.
>>
>> The problem with this is sometimes you get wrong matches because of noise.
>> For example, "Black Nokia N95, 8GB + Free bluetooth headset" can match
>> "Nokia Bluetooth Headset" which is desirable.
>>
>
>
> I assume you mean "not desirable" here given the context...
>
> Your approach is worth trying.  At a deeper level, you may want to look
> into a topic called "record linkage" and an open source project called
> Second String by William Cohen's group at Carnegie Mellon (
> http://secondstring.sourceforge.net/) which has a whole bunch of
> implementations of fuzzy string matching algorithms like Jaro-Winkler,
> Levenstein, etc. that can then be used to implement what you are after.
>
> You could potentially use the spell checking functionality to simulate some
> of this a bit better than just a pure vector match.  Index your dictionary
> into a spelling index (see SOLR-572) and then send in spell checking
> queries.  In fact, you probably could integrate Second String into the spell
> checker pretty easily since one can now plugin the distance measure into the
> spell checker.
>
> You may find some help on this by searching http://lucene.markmail.org for
> things like "record linkage" or "record matching" or various other related
> terms.
>
> Another option is to write up a NormalizingTokenFilter that analyzes the
> tokens as they come in to see if they match your dictionary list.
>
> As with all of these, there is going to be some trial and error here to
> come up with something that hits most of the time, as it will never be
> perfect.
>
> Good luck,
> Grant
>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>


-- 
Regards,

Cuong Hoang

Reply via email to