[I] [Bug] Lead cost in boolean conjunction queries can be miscalculated [lucene]

via GitHub Tue, 22 Apr 2025 17:04:51 -0700


peteralfonsi opened a new issue, #14542:
URL: https://github.com/apache/lucene/issues/14542


   ### Description
   
   In boolean conjunction queries, iteration is led by the DISI with the lowest 
`cost()`. To do this we compute the `leadCost` first from the relevant 
`ScorerSupplier`s, and then get the actual `Scorer`s using `leadCost`. 
   
   In Lucene 10.0 and onwards, there seems to be a bug in 
`BooleanScorerSupplier.requiredBulkScorer()`. I think the intention is to 
return the minimum `cost()` of the scorer suppliers for all MUST and FILTER 
clause, which matches my understanding of how conjunctions should work, and 
also matches what would happen in [the pre-10.0 
code](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md). 
   
   However the 
[code](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) now does: 
   
   ```
   long leadCost =
           
subs.get(Occur.MUST).stream().mapToLong(ScorerSupplier::cost).min().orElse(Long.MAX_VALUE);
       leadCost =
           
subs.get(Occur.FILTER).stream().mapToLong(ScorerSupplier::cost).min().orElse(leadCost);
   ```
   so that if there is both a MUST and a FILTER clause, leadCost always ends up 
equaling the minimum FILTER clause's cost, even if this is greater than the 
minimum MUST cost. 
   
   This can cause a significant performance hit for such queries. See this 
OpenSearch issue 
(https://github.com/opensearch-project/OpenSearch/issues/17870) for some more 
details and some benchmark numbers. 
   
   In affected queries, most of the CPU time is actually spent constructing the 
BulkScorer, not using it to score documents. This happens because the 
ScorerSupplier provided by IndexOrDocValuesQuery uses `leadCost` to [choose 
whether](https://github.com/apache/lucene/blob/a8b503fdf250b9dbd183cf953fead380b3d51d34/lucene/core/src/java/org/apache/lucene/search/IndexOrDocValuesQuery.java#L174)
 to get the cost from `indexScorerSupplier.get(leadCost)` (in my case from 
PointRangeQuery), or `dvScorerSupplier.get(leadCost)`. 
   
   When `leadCost` is miscalculated due to the bug, it's higher than it really 
should have been, and so we erroneously use 
`indexScorerSupplier.get(leadCost)`. In this case, that requires running a 
bunch of BKD tree-related code, which slows the query by 40-300% in my tests. 
   
   See this flamegraph for an example:
   
   
![Image](https://github.com/user-attachments/assets/72fd1ada-b239-40db-87c2-95d5b27a3ad8)
   
   
   ### Version and environment details
   
   Tar install of current OpenSearch 3.0, which uses Lucene 10.1, running on an 
AL2 ec2 instance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] [Bug] Lead cost in boolean conjunction queries can be miscalculated [lucene]

Reply via email to