jpountz commented on PR #12490:
URL: https://github.com/apache/lucene/pull/12490#issuecomment-1666957596

   I remember seeing cases when propagating the minimum score on inner clauses 
and take advantage of their skipping logic helped when I started working on 
dynamic pruning indeed, but there have been many changes since then and I can't 
easily find queries that perform better when propagating anymore.
   
   I suspect that one change that plays a role here was the introduction of 
block-max indexes. Since minimum competitive scores are only allowed to go up, 
we can only propagate a minimum score to an inner clause of a boolean query 
that is equal to the minimum competitive score of the boolean query minus the 
sum of the **global** maximum scores of all other clauses. This gives 
`BooleanQuery` better score bounds to work with than its inner `TermQuery`s.
   
   For instance I tried to reproduce the case that you suggested by taking a 
very high-frequency term on wikimedium10m: `ref` (docFreq=877152, maxScore ~= 
1) and a low-frequency term: `quark` (docFreq=566, maxScore ~=8), and both 
conjunctions and disjunctions still run faster with this change. I only managed 
to make the query slower with this change by putting a boost of 1000 on the 
high-frequency term. In that case, propagating the minimum competitive score 
helps because `BooleanQuery` only knows how to skip one block at a time right 
now while `TermQuery` is able to look at impacts at higher levels in skip lists 
to skip many blocks at once.
   
   Something also worth noting is that queries that would benefit from 
propagating the minimum score would generally be queries where a single clause 
dominates the overall score, and these queries are already handled quite 
efficiently by boolean queries. On the other hand, queries where MAXSCORE/WAND 
struggle to help (e.g. because the k value is high, because there are many 
clauses, or because clauses have suboptimal score upper bounds) get a higher 
overhead due to the wrapping with `ImpactsDISI`. So the current approach tends 
to make slow queries slower in exchange of potentially making fast queries 
faster, which is usually not the best trade-off.
   
   > How can callers specify that they do want skipping to be applied in some 
case?
   
   I guess we could have some heuristics when to tell some inner clause that 
they should skip based on their minimum score. E.g. maybe we could do this when 
we have a clause whose max score accounts for more than X% of the total score 
or something along these lines, which suggests that propagating maximim scores 
to this inner clause may help. But it also makes things more complicated 
API-wase since scorer suppliers don't have access to max scores.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to