below
On Jun 27, 2008, at 1:18 AM, climbingrose wrote:
Firstly, my apologies for being off topic. I'm asking this question
because
I think there are some machine learning and text processing experts
on this
mailing list.
Basically, my task is to normalize a fairly unstructured set of
short texts
using a dictionary. We have a pre-defined list of products and
periodically
receive product feeds from various websites. Basically, our site is
similar
to a shopping comparison engine but on a different domain. We would
like to
normalize the products' names in the feeds to using our pre-defined
list.
For example:
"Nokia N95 8GB Black" ---> "Nokia N95 8GB"
"Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB"
My original idea is to index the list of pre-defined names and then
query
the index using the product's name. The highest scored result will
be used
to normalize the product.
The problem with this is sometimes you get wrong matches because of
noise.
For example, "Black Nokia N95, 8GB + Free bluetooth headset" can match
"Nokia Bluetooth Headset" which is desirable.
I assume you mean "not desirable" here given the context...
Your approach is worth trying. At a deeper level, you may want to
look into a topic called "record linkage" and an open source project
called Second String by William Cohen's group at Carnegie Mellon (http://secondstring.sourceforge.net/
) which has a whole bunch of implementations of fuzzy string matching
algorithms like Jaro-Winkler, Levenstein, etc. that can then be used
to implement what you are after.
You could potentially use the spell checking functionality to simulate
some of this a bit better than just a pure vector match. Index your
dictionary into a spelling index (see SOLR-572) and then send in spell
checking queries. In fact, you probably could integrate Second String
into the spell checker pretty easily since one can now plugin the
distance measure into the spell checker.
You may find some help on this by searching http://lucene.markmail.org
for things like "record linkage" or "record matching" or various other
related terms.
Another option is to write up a NormalizingTokenFilter that analyzes
the tokens as they come in to see if they match your dictionary list.
As with all of these, there is going to be some trial and error here
to come up with something that hits most of the time, as it will never
be perfect.
Good luck,
Grant
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ