Hello all, For our search system we'd like to be able to process merged tokens, i.e. when a user enters a query like "hotelsin barcelona", we'd like to know that the user means "hotels in barcelona".
At some point in the past we implemented this kind of functionality with shingles (using ShingleFilter), that is, if we were indexing the sentence "hotels in barcelona" as a document, we'd be able to match at query time merged tokens like "hotelsin" and "inbarcelona". This solution has two problems: 1) The index size increases a lot. 2) We only catch a small % of the possibilities. Merged tokens like "hotelsbarcelona" or "barcelonahotels" cannot be processed. Our intuition is that there should be a better solution. Maybe it's solved in SOLR or Lucene and we haven't found it yet. If it's not solved, I can imagine a naive solution that would use TermsEnum to identify whether a token exists in the index or not, and then if it doesn't exist, use the TermsEnum again to check whether it's a composition of two known tokens. It's highly likely that there are much better solutions and algorithms for this. It would be great if you can help us identify the best way to solve this problem. Thanks a lot for your help. Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas