Re: Suggestion for short text matching using dictionary

Grant Ingersoll Fri, 27 Jun 2008 11:04:26 -0700

Yeah, I don't know if SecondString scales. Note that Lucene now hasan implementation of Jaro-Winkler, which is a pretty good distancemeasure, so you may want to give that a try, plus if you see speedups,feel free to contrib a patch ;-)

I'm wondering if Hadoop couldn't help w/ the scale problem. Perhapsyou might ask over on the Mahout user list, too, as there are a fairnumber of text geeks over there. Might be interesting to think abouta contribution in this department. Also Tom and I are likely to havea chapter on this topic (and other "string" related issues) in "TamingText" (Manning), but that one isn't written yet (shameless plug, Iknow, sorry!) Tom definitely knows more on this particular topic thenI do.


-Grant

On Jun 27, 2008, at 11:25 AM, climbingrose wrote:

Thanks Grant. I did try Secondstring before and found out that itwasn'tparticular good for doing a lot of text matching. I'm leaning towardthecombination of Lucene and Secondstring. Googling around a bit, Icame across
this project http://datamining.anu.edu.au/projects/linkage.html. Looks
interesting but the implementation is in Python though. I think theyuse
Hidden Markov Model to label training data then matching records
probalistically.
On Fri, Jun 27, 2008 at 10:12 PM, Grant Ingersoll<[EMAIL PROTECTED]>
wrote:
below



On Jun 27, 2008, at 1:18 AM, climbingrose wrote:

Firstly, my apologies for being off topic. I'm asking this question
because
I think there are some machine learning and text processingexperts on
this
mailing list.
Basically, my task is to normalize a fairly unstructured set ofshort
texts
using a dictionary. We have a pre-defined list of products and
periodically
receive product feeds from various websites. Basically, our site is
similar
to a shopping comparison engine but on a different domain. Wewould like
to
normalize the products' names in the feeds to using our pre-defined list.
For example:

"Nokia N95 8GB Black" ---> "Nokia N95 8GB"
"Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB"
My original idea is to index the list of pre-defined names andthen querythe index using the product's name. The highest scored result willbe used
to normalize the product.
The problem with this is sometimes you get wrong matches becauseof noise.For example, "Black Nokia N95, 8GB + Free bluetooth headset" canmatch
"Nokia Bluetooth Headset" which is desirable.
I assume you mean "not desirable" here given the context...
Your approach is worth trying. At a deeper level, you may want tolookinto a topic called "record linkage" and an open source projectcalled
Second String by William Cohen's group at Carnegie Mellon (
http://secondstring.sourceforge.net/) which has a whole bunch of
implementations of fuzzy string matching algorithms like Jaro-Winkler,Levenstein, etc. that can then be used to implement what you areafter.
You could potentially use the spell checking functionality tosimulate someof this a bit better than just a pure vector match. Index yourdictionary
into a spelling index (see SOLR-572) and then send in spell checking
queries. In fact, you probably could integrate Second String intothe spellchecker pretty easily since one can now plugin the distance measureinto the
spell checker.
You may find some help on this by searching http://lucene.markmail.orgforthings like "record linkage" or "record matching" or various otherrelated
terms.
Another option is to write up a NormalizingTokenFilter thatanalyzes the
tokens as they come in to see if they match your dictionary list.
As with all of these, there is going to be some trial and errorhere tocome up with something that hits most of the time, as it will neverbe
perfect.

Good luck,
Grant


--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
--
Regards,

Cuong Hoang


--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Suggestion for short text matching using dictionary

Reply via email to