[GitHub] [lucene] jpountz commented on pull request #12415: Optimize disjunction counts.

via GitHub Thu, 27 Jul 2023 10:06:12 -0700


jpountz commented on PR #12415:
URL: https://github.com/apache/lucene/pull/12415#issuecomment-1654029579


   > I'd love to start benchmarking charts for these before we land this opto 
so we can fully appreciate / document the "pop" 
   
   +1 I'll wait for a few data point before merging
   
   > I wonder what other queries could (later) benefit from DocIdStream bulk 
collection ...
   
   I tried to think about this too.
   
   `MatchAllDocsQuery` is an obvious candidate, but it's already optimized 
differently using `Weight#count`. It's probably still a good idea to implement 
this API on `MatchAllDocsQuery` so that it would help pure negations (a 
`MatchAllDocsQuery` in a MUST/FILTER/SHOULD clause, and one or more MUST_NOT 
clauses), as this will trigger usage of `ReqExclBulkScorer` which will delegate 
to the `MatchAllDocsQuery` `BulkScorer`.
   
   Queries that produce bitsets could also implement a similar optimization, 
e.g. (numeric) `range`, `prefix`, `wildcard` or `geo` queries. I expect the 
cost of building the bitset to dominate the overall execution time, but it will 
probably still yield a noticeable speedup. A question I'm wondering there is 
whether we should pass the entire BitSet as a single `DocIdStream` or if there 
are reasons why we should split it anyway.
   
   Term queries could theoretically return a DocIdStream per block of 128 doc 
IDs, where decoding would happen lazily at the beginning of 
`DocIdStream#forEach` and `DocIdStream#count` would return 128 without even 
decoding postings. This would require more intimate integration with the codec 
as we don't have the right APIs to do this at the moment.
   
   And like you already suggested, we could handle some conjunctions if we ran 
them through BS1.
   
   In general, deletions will tend to disable this optimization. (BS1 is a 
notable case when deletions would not disable this optimization) It might help 
to have a `nextClearBit` on `Bits` to be able to still apply this optimization. 
E.g. `MatchAllDocsQuery` could use `Bits#nextClearBit` on live docs to create a 
`DocIdStream` for every sequence of adjacent non-deleted doc IDs to speed up 
counting under sparse deletions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #12415: Optimize disjunction counts.

Reply via email to