Re: [PR] Enable rank-unsafe optimization of top-k hit computations by quantizing scores. [lucene]

via GitHub Sun, 08 Oct 2023 13:17:42 -0700


jpountz commented on PR #12628:
URL: https://github.com/apache/lucene/pull/12628#issuecomment-1752152301

I'll try to give a bit more context how I ended up here. With recent work on
vector search and excitement around it, I can't prevent myself from thinking
that all users who are happy to trade some effectiveness for better efficiency
for their vector searches should be happy to do it with their term-based
searches as well.

There is some literature on rank-unsafe evaluation of top-hits, the main one
I'm familiar with is the original [WAND
paper](https://www.researchgate.net/profile/David-Carmel-3/publication/221613425_Efficient_query_evaluation_using_a_two-level_retrieval_process/links/02bfe50e6854500153000000/Efficient-query-evaluation-using-a-two-level-retrieval-process.pdf)
where authors suggest to configure a threshold factor F and run the WAND
operator so that it only considers matches whose sum of maximum scores across
query terms is at least F times the score of the k-th score in the priority
queue. When F is equal to 1 or less, the returned top hits are the same as the
ones produced by exhaustive evaluation. Setting F greater than 1 allows trading
effectiveness for efficiency. Authors share some experiments that show how F=2
affects P@10 and MAP, as well as the number of evaluated hits.

I'd like to expose this kind of trade-off in Lucene. One problem I have with
the above approach is that it is very WAND-specific. A similar idea that
wouldn't be WAND-specific would be to configure the minimum score
(`Scorer#setMinimumCompetitiveScore`) as the k-th score in the heap times some
factor. This feels a bit too low-level for something to be exposed as an option
on `IndexSearcher`, so I ended up trying this idea of quantizing scores, which
is again very similar in terms of why it helps, and that I find easier to
explain: it computes accurate top hits on quantized scores. To get a bit of
time thinking of how to expose it on `IndexSearcher`, I'm starting with this
collector wrapper which doesn't require making changes to lucene-core and would
help put it in the hands of users who would be interested in trying it out.

For reference, https://github.com/apache/lucene/pull/12446 is another PR
where I'm exploring exposing rank-unsafe optimizations in Lucene.

> Would the typical usage of this scorer be as an initial collection of k
docs, which would then get re-scored over their full floating point scores?

If someone wanted to do that, they could do it in a single pass by using a
similar collector wrapper as the one I'm adding, and only changing scores
passed to `Scorer#setMinimumCompetitiveScore`, not the ones returned by
`Scorer#score`. I think the idea is rather for it to be used as a drop-in
replacement for traditional top-k collection, just trading a bit of
effectiveness for better efficiency. It's probably especially acceptable with
high `k` values, when hits are expected to be fed to a reranker in a later
stage.

> Are there any numbers around how this effects recall or does the Tantivy
benchmark account for actually getting the correct document?

No, there are no measures of the effect on recall or other measures of
effectiveness.

> Very cool, surprisingly impactful!

For reference, I noticed that it especially helps in the following cases:
- Stop words mixed with high-scoring clauses: in that case the quantization
helps the score contribution of stop words become a rounding error more than an
actual score contribution, which in-turn helps the pruning logic consider
almost only the high-scoring clauses. A good such example from the benchmark is
`the progressive`. The query `the incredibles` is an extreme case which is due
to the fact that `incredibles` has an extremely low term frequency, so the
query looks like a query on `the` in practice where a few hits get boosted
because they contain `incredibles`.
- Clauses with similar maximum scores. These queries are complicated
because it's rare that the same document yields the maximum score in the block
for both clauses. So actual scores produced by such queries are often quite
behind the sum of maximum scores of each clause. They can run like conjunctions
in practice, but they can rarely skip blocks across all clauses. Quantization
makes this more likely. A good example from the benchmark is `american south`.
- Conjunctions. Conjunctions suffer from the same problem as above, it's
rare that all clauses yield their maximum score on the same document. This
makes skipping whole blocks rare in practice. Quantization makes it more likely
by creating more ties.

> This is the Tantivy benchmark tooling, but you are comparing Lucene (main)
to Lucene (your PR) right?

Actually since this only introduces a single class, I copied this class to
the source folder of the `engine` querying logic. So the baseline is Lucene
9.8.0 and the contender is Lucene 9.8.0 with the top-score collector wrapped
into the collector wrapper added in this PR, and configured with
`accuracyBits=5`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Enable rank-unsafe optimization of top-k hit computations by quantizing scores. [lucene]

Reply via email to