jpountz commented on PR #12628:
URL: https://github.com/apache/lucene/pull/12628#issuecomment-1752152301

   I'll try to give a bit more context how I ended up here. With recent work on 
vector search and excitement around it, I can't prevent myself from thinking 
that all users who are happy to trade some effectiveness for better efficiency 
for their vector searches should be happy to do it with their term-based 
searches as well.
   
   There is some literature on rank-unsafe evaluation of top-hits, the main one 
I'm familiar with is the original [WAND 
paper](https://www.researchgate.net/profile/David-Carmel-3/publication/221613425_Efficient_query_evaluation_using_a_two-level_retrieval_process/links/02bfe50e6854500153000000/Efficient-query-evaluation-using-a-two-level-retrieval-process.pdf)
 where authors suggest to configure a threshold factor F and run the WAND 
operator so that it only considers matches whose sum of maximum scores across 
query terms is at least F times the score of the k-th score in the priority 
queue. When F is equal to 1 or less, the returned top hits are the same as the 
ones produced by exhaustive evaluation. Setting F greater than 1 allows trading 
effectiveness for efficiency. Authors share some experiments that show how F=2 
affects P@10 and MAP, as well as the number of evaluated hits.
   
   I'd like to expose this kind of trade-off in Lucene. One problem I have with 
the above approach is that it is very WAND-specific. A similar idea that 
wouldn't be WAND-specific would be to configure the minimum score 
(`Scorer#setMinimumCompetitiveScore`) as the k-th score in the heap times some 
factor. This feels a bit too low-level for something to be exposed as an option 
on `IndexSearcher`, so I ended up trying this idea of quantizing scores, which 
is again very similar in terms of why it helps, and that I find easier to 
explain: it computes accurate top hits on quantized scores. To get a bit of 
time thinking of how to expose it on `IndexSearcher`, I'm starting with this 
collector wrapper which doesn't require making changes to lucene-core and would 
help put it in the hands of users who would be interested in trying it out.
   
   For reference, https://github.com/apache/lucene/pull/12446 is another PR 
where I'm exploring exposing rank-unsafe optimizations in Lucene.
   
   > Would the typical usage of this scorer be as an initial collection of k 
docs, which would then get re-scored over their full floating point scores?
   
   If someone wanted to do that, they could do it in a single pass by using a 
similar collector wrapper as the one I'm adding, and only changing scores 
passed to `Scorer#setMinimumCompetitiveScore`, not the ones returned by 
`Scorer#score`. I think the idea is rather for it to be used as a drop-in 
replacement for traditional top-k collection, just trading a bit of 
effectiveness for better efficiency. It's probably especially acceptable with 
high `k` values, when hits are expected to be fed to a reranker in a later 
stage.
   
   > Are there any numbers around how this effects recall or does the Tantivy 
benchmark account for actually getting the correct document?
   
   No, there are no measures of the effect on recall or other measures of 
effectiveness.
   
   > Very cool, surprisingly impactful!
   
   For reference, I noticed that it especially helps in the following cases:
    - Stop words mixed with high-scoring clauses: in that case the quantization 
helps the score contribution of stop words become a rounding error more than an 
actual score contribution, which in-turn helps the pruning logic consider 
almost only the high-scoring clauses. A good such example from the benchmark is 
`the progressive`. The query `the incredibles` is an extreme case which is due 
to the fact that `incredibles` has an extremely low term frequency, so the 
query looks like a query on `the` in practice where a few hits get boosted 
because they contain `incredibles`.
    - Clauses with similar maximum scores. These queries are complicated 
because it's rare that the same document yields the maximum score in the block 
for both clauses. So actual scores produced by such queries are often quite 
behind the sum of maximum scores of each clause. They can run like conjunctions 
in practice, but they can rarely skip blocks across all clauses. Quantization 
makes this more likely. A good example from the benchmark is `american south`.
    - Conjunctions. Conjunctions suffer from the same problem as above, it's 
rare that all clauses yield their maximum score on the same document. This 
makes skipping whole blocks rare in practice. Quantization makes it more likely 
by creating more ties.
   
   > This is the Tantivy benchmark tooling, but you are comparing Lucene (main) 
to Lucene (your PR) right?
   
   Actually since this only introduces a single class, I copied this class to 
the source folder of the `engine` querying logic. So the baseline is Lucene 
9.8.0 and the contender is Lucene 9.8.0 with the top-score collector wrapped 
into the collector wrapper added in this PR, and configured with 
`accuracyBits=5`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to