jpountz commented on PR #12490: URL: https://github.com/apache/lucene/pull/12490#issuecomment-1666957596
I remember seeing cases when propagating the minimum score on inner clauses and take advantage of their skipping logic helped when I started working on dynamic pruning indeed, but there have been many changes since then and I can't easily find queries that perform better when propagating anymore. I suspect that one change that plays a role here was the introduction of block-max indexes. Since minimum competitive scores are only allowed to go up, we can only propagate a minimum score to an inner clause of a boolean query that is equal to the minimum competitive score of the boolean query minus the sum of the **global** maximum scores of all other clauses. This gives `BooleanQuery` better score bounds to work with than its inner `TermQuery`s. For instance I tried to reproduce the case that you suggested by taking a very high-frequency term on wikimedium10m: `ref` (docFreq=877152, maxScore ~= 1) and a low-frequency term: `quark` (docFreq=566, maxScore ~=8), and both conjunctions and disjunctions still run faster with this change. I only managed to make the query slower with this change by putting a boost of 1000 on the high-frequency term. In that case, propagating the minimum competitive score helps because `BooleanQuery` only knows how to skip one block at a time right now while `TermQuery` is able to look at impacts at higher levels in skip lists to skip many blocks at once. Something also worth noting is that queries that would benefit from propagating the minimum score would generally be queries where a single clause dominates the overall score, and these queries are already handled quite efficiently by boolean queries. On the other hand, queries where MAXSCORE/WAND struggle to help (e.g. because the k value is high, because there are many clauses, or because clauses have suboptimal score upper bounds) get a higher overhead due to the wrapping with `ImpactsDISI`. So the current approach tends to make slow queries slower in exchange of potentially making fast queries faster, which is usually not the best trade-off. > How can callers specify that they do want skipping to be applied in some case? I guess we could have some heuristics when to tell some inner clause that they should skip based on their minimum score. E.g. maybe we could do this when we have a clause whose max score accounts for more than X% of the total score or something along these lines, which suggests that propagating maximim scores to this inner clause may help. But it also makes things more complicated API-wase since scorer suppliers don't have access to max scores. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org