Firstly, my apologies for being off topic. I'm asking this question because
I think there are some machine learning and text processing experts on this
mailing list.

Basically, my task is to normalize a fairly unstructured set of short texts
using a dictionary. We have a pre-defined list of products and periodically
receive product feeds from various websites. Basically, our site is similar
to a shopping comparison engine but on a different domain. We would like to
normalize the products' names in the feeds to using our pre-defined list.
For example:

"Nokia N95 8GB Black" ---> "Nokia N95 8GB"
"Black Nokia N95, 8GB + Free bluetooth headset" --> "Nokia N95 8GB"

My original idea is to index the list of pre-defined names and then query
the index using the product's name. The highest scored result will be used
to normalize the product.

The problem with this is sometimes you get wrong matches because of noise.
For example, "Black Nokia N95, 8GB + Free bluetooth headset" can match
"Nokia Bluetooth Headset" which is desirable.

Is there a better solution for this problem? Thanks in advance.

-- 
Regards,

Cuong Hoang

Reply via email to