processing of merged tokens

Carlos Gonzalez-Cadenas Mon, 20 Feb 2012 01:42:42 -0800

Hello all,

For our search system we'd like to be able to process merged tokens, i.e.
when a user enters a query like "hotelsin barcelona", we'd like to know
that the user means "hotels in barcelona".


At some point in the past we implemented this kind of functionality with
shingles (using ShingleFilter), that is, if we were indexing the sentence
"hotels in barcelona" as a document, we'd be able to match at query time
merged tokens like "hotelsin" and "inbarcelona".

This solution has two problems:
1) The index size increases a lot.
2) We only catch a small % of the possibilities. Merged tokens like
"hotelsbarcelona" or "barcelonahotels" cannot be processed.

Our intuition is that there should be a better solution. Maybe it's solved
in SOLR or Lucene and we haven't found it yet. If it's not solved, I can
imagine a naive solution that would use TermsEnum to identify whether a
token exists in the index or not, and then if it doesn't exist, use the
TermsEnum again to check whether it's a composition of two known tokens.

It's highly likely that there are much better solutions and algorithms for
this. It would be great if you can help us identify the best way to solve
this problem.

Thanks a lot for your help.

Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas

processing of merged tokens

Reply via email to