problems with DisjunctionMaxQuery and early-termination

Carlos Gonzalez-Cadenas Thu, 15 Mar 2012 23:10:32 -0700

Hello all,

We have a SOLR index filled with user queries and we want to retrieve the
ones that are more similar to a given query entered by an end-user. It is
kind of a "related queries" system.


The index is pretty big and we're using early-termination of queries (with
the index sorted so that the "more popular" queries have lower docids and
therefore the termination yields higher-quality results)

Clearly, when the user enters a user-level query into the search box, i.e.
"cheap hotels barcelona offers", we don't know whether there exists a
document (query) in the index that contains these four words or not.
 Therefore, when we're building the SOLR query, the first intuition would
be to do a query like this "cheap OR hotels OR barcelona OR offers".

If all the documents in the index where evaluated, the results of this
query would be good. For example, if there is no query in the index with
these four words but there's a query in the index with the text "cheap
hotels barcelona", it will probably be one of the top results, which is
precisely what we want.

The problem is that we're doing early termination and therefore this query
will exhaust very fast the top-K result limit (our custom collector limits
on the number of evaluated documents), given that queries like "hotels in
madrid" or "hotels in NYC" will match the OR expression described above
(because they all match "hotels").

Our next step was to think in a DisjunctionMaxQuery, trying to write a
query like this:

DisjunctionMaxQuery:
 1) +cheap +hotels +barcelona +offers
 2) +cheap +hotels +barcelona
 3) +cheap +hotels
 4) +hotels

We were thinking that perhaps the sub-queries within the
DisjunctionMaxQuery were going to get evaluated in "parallel" given that
they're separated queries, but in fact from a runtime perspective it does
behave in a similar way than the OR query that we described above.

Our desired behavior is to try match documents with each subquery within
the DisjunctionMaxQuery (up to a per-subquery limit that we put) and then
score them and return them all together (therefore we don't want all the
matches being done by a single sub-query, like it's happening now).

Clearly, we could create a script external to SOLR that just runs the
several sub-queries as standalone queries and then joins all the results
together, but before going for this we'd like to know if you have any ideas
on how to solve this problem within SOLR. We do have our own QParser, and
therefore we'd be able to implement any arbitrary query construction that
you can come up with, or even create a new Query type if it's needed.

Thanks a lot for your help,
Carlos


Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas

problems with DisjunctionMaxQuery and early-termination

Reply via email to