jpountz commented on PR #12628: URL: https://github.com/apache/lucene/pull/12628#issuecomment-1752152301
I'll try to give a bit more context how I ended up here. With recent work on vector search and excitement around it, I can't prevent myself from thinking that all users who are happy to trade some effectiveness for better efficiency for their vector searches should be happy to do it with their term-based searches as well. There is some literature on rank-unsafe evaluation of top-hits, the main one I'm familiar with is the original [WAND paper](https://www.researchgate.net/profile/David-Carmel-3/publication/221613425_Efficient_query_evaluation_using_a_two-level_retrieval_process/links/02bfe50e6854500153000000/Efficient-query-evaluation-using-a-two-level-retrieval-process.pdf) where authors suggest to configure a threshold factor F and run the WAND operator so that it only considers matches whose sum of maximum scores across query terms is at least F times the score of the k-th score in the priority queue. When F is equal to 1 or less, the returned top hits are the same as the ones produced by exhaustive evaluation. Setting F greater than 1 allows trading effectiveness for efficiency. Authors share some experiments that show how F=2 affects P@10 and MAP, as well as the number of evaluated hits. I'd like to expose this kind of trade-off in Lucene. One problem I have with the above approach is that it is very WAND-specific. A similar idea that wouldn't be WAND-specific would be to configure the minimum score (`Scorer#setMinimumCompetitiveScore`) as the k-th score in the heap times some factor. This feels a bit too low-level for something to be exposed as an option on `IndexSearcher`, so I ended up trying this idea of quantizing scores, which is again very similar in terms of why it helps, and that I find easier to explain: it computes accurate top hits on quantized scores. To get a bit of time thinking of how to expose it on `IndexSearcher`, I'm starting with this collector wrapper which doesn't require making changes to lucene-core and would help put it in the hands of users who would be interested in trying it out. For reference, https://github.com/apache/lucene/pull/12446 is another PR where I'm exploring exposing rank-unsafe optimizations in Lucene. > Would the typical usage of this scorer be as an initial collection of k docs, which would then get re-scored over their full floating point scores? If someone wanted to do that, they could do it in a single pass by using a similar collector wrapper as the one I'm adding, and only changing scores passed to `Scorer#setMinimumCompetitiveScore`, not the ones returned by `Scorer#score`. I think the idea is rather for it to be used as a drop-in replacement for traditional top-k collection, just trading a bit of effectiveness for better efficiency. It's probably especially acceptable with high `k` values, when hits are expected to be fed to a reranker in a later stage. > Are there any numbers around how this effects recall or does the Tantivy benchmark account for actually getting the correct document? No, there are no measures of the effect on recall or other measures of effectiveness. > Very cool, surprisingly impactful! For reference, I noticed that it especially helps in the following cases: - Stop words mixed with high-scoring clauses: in that case the quantization helps the score contribution of stop words become a rounding error more than an actual score contribution, which in-turn helps the pruning logic consider almost only the high-scoring clauses. A good such example from the benchmark is `the progressive`. The query `the incredibles` is an extreme case which is due to the fact that `incredibles` has an extremely low term frequency, so the query looks like a query on `the` in practice where a few hits get boosted because they contain `incredibles`. - Clauses with similar maximum scores. These queries are complicated because it's rare that the same document yields the maximum score in the block for both clauses. So actual scores produced by such queries are often quite behind the sum of maximum scores of each clause. They can run like conjunctions in practice, but they can rarely skip blocks across all clauses. Quantization makes this more likely. A good example from the benchmark is `american south`. - Conjunctions. Conjunctions suffer from the same problem as above, it's rare that all clauses yield their maximum score on the same document. This makes skipping whole blocks rare in practice. Quantization makes it more likely by creating more ties. > This is the Tantivy benchmark tooling, but you are comparing Lucene (main) to Lucene (your PR) right? Actually since this only introduces a single class, I copied this class to the source folder of the `engine` querying logic. So the baseline is Lucene 9.8.0 and the contender is Lucene 9.8.0 with the top-score collector wrapped into the collector wrapper added in this PR, and configured with `accuracyBits=5`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org