Re: Suggestion for short text matching using dictionary

Grant Ingersoll Fri, 27 Jun 2008 05:12:51 -0700

below


On Jun 27, 2008, at 1:18 AM, climbingrose wrote:

Firstly, my apologies for being off topic. I'm asking this questionbecauseI think there are some machine learning and text processing expertson this
mailing list.
Basically, my task is to normalize a fairly unstructured set ofshort textsusing a dictionary. We have a pre-defined list of products andperiodicallyreceive product feeds from various websites. Basically, our site issimilarto a shopping comparison engine but on a different domain. We wouldlike tonormalize the products' names in the feeds to using our pre-definedlist.
For example:

"Nokia N95 8GB Black" ---> "Nokia N95 8GB"
"Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB"
My original idea is to index the list of pre-defined names and thenquerythe index using the product's name. The highest scored result willbe used
to normalize the product.
The problem with this is sometimes you get wrong matches because ofnoise.
For example, "Black Nokia N95, 8GB + Free bluetooth headset" can match
"Nokia Bluetooth Headset" which is desirable.



I assume you mean "not desirable" here given the context...

Your approach is worth trying. At a deeper level, you may want tolook into a topic called "record linkage" and an open source projectcalled Second String by William Cohen's group at Carnegie Mellon (http://secondstring.sourceforge.net/) which has a whole bunch of implementations of fuzzy string matchingalgorithms like Jaro-Winkler, Levenstein, etc. that can then be usedto implement what you are after.

You could potentially use the spell checking functionality to simulatesome of this a bit better than just a pure vector match. Index yourdictionary into a spelling index (see SOLR-572) and then send in spellchecking queries. In fact, you probably could integrate Second Stringinto the spell checker pretty easily since one can now plugin thedistance measure into the spell checker.

You may find some help on this by searching http://lucene.markmail.orgfor things like "record linkage" or "record matching" or various otherrelated terms.

Another option is to write up a NormalizingTokenFilter that analyzesthe tokens as they come in to see if they match your dictionary list.

As with all of these, there is going to be some trial and error hereto come up with something that hits most of the time, as it will neverbe perfect.


Good luck,
Grant


--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Suggestion for short text matching using dictionary

Reply via email to