Hello all, We have a SOLR index filled with user queries and we want to retrieve the ones that are more similar to a given query entered by an end-user. It is kind of a "related queries" system.
The index is pretty big and we're using early-termination of queries (with the index sorted so that the "more popular" queries have lower docids and therefore the termination yields higher-quality results) Clearly, when the user enters a user-level query into the search box, i.e. "cheap hotels barcelona offers", we don't know whether there exists a document (query) in the index that contains these four words or not. Therefore, when we're building the SOLR query, the first intuition would be to do a query like this "cheap OR hotels OR barcelona OR offers". If all the documents in the index where evaluated, the results of this query would be good. For example, if there is no query in the index with these four words but there's a query in the index with the text "cheap hotels barcelona", it will probably be one of the top results, which is precisely what we want. The problem is that we're doing early termination and therefore this query will exhaust very fast the top-K result limit (our custom collector limits on the number of evaluated documents), given that queries like "hotels in madrid" or "hotels in NYC" will match the OR expression described above (because they all match "hotels"). Our next step was to think in a DisjunctionMaxQuery, trying to write a query like this: DisjunctionMaxQuery: 1) +cheap +hotels +barcelona +offers 2) +cheap +hotels +barcelona 3) +cheap +hotels 4) +hotels We were thinking that perhaps the sub-queries within the DisjunctionMaxQuery were going to get evaluated in "parallel" given that they're separated queries, but in fact from a runtime perspective it does behave in a similar way than the OR query that we described above. Our desired behavior is to try match documents with each subquery within the DisjunctionMaxQuery (up to a per-subquery limit that we put) and then score them and return them all together (therefore we don't want all the matches being done by a single sub-query, like it's happening now). Clearly, we could create a script external to SOLR that just runs the several sub-queries as standalone queries and then joins all the results together, but before going for this we'd like to know if you have any ideas on how to solve this problem within SOLR. We do have our own QParser, and therefore we'd be able to implement any arbitrary query construction that you can come up with, or even create a new Query type if it's needed. Thanks a lot for your help, Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas