jpountz commented on PR #12415: URL: https://github.com/apache/lucene/pull/12415#issuecomment-1654029579
> I'd love to start benchmarking charts for these before we land this opto so we can fully appreciate / document the "pop" +1 I'll wait for a few data point before merging > I wonder what other queries could (later) benefit from DocIdStream bulk collection ... I tried to think about this too. `MatchAllDocsQuery` is an obvious candidate, but it's already optimized differently using `Weight#count`. It's probably still a good idea to implement this API on `MatchAllDocsQuery` so that it would help pure negations (a `MatchAllDocsQuery` in a MUST/FILTER/SHOULD clause, and one or more MUST_NOT clauses), as this will trigger usage of `ReqExclBulkScorer` which will delegate to the `MatchAllDocsQuery` `BulkScorer`. Queries that produce bitsets could also implement a similar optimization, e.g. (numeric) `range`, `prefix`, `wildcard` or `geo` queries. I expect the cost of building the bitset to dominate the overall execution time, but it will probably still yield a noticeable speedup. A question I'm wondering there is whether we should pass the entire BitSet as a single `DocIdStream` or if there are reasons why we should split it anyway. Term queries could theoretically return a DocIdStream per block of 128 doc IDs, where decoding would happen lazily at the beginning of `DocIdStream#forEach` and `DocIdStream#count` would return 128 without even decoding postings. This would require more intimate integration with the codec as we don't have the right APIs to do this at the moment. And like you already suggested, we could handle some conjunctions if we ran them through BS1. In general, deletions will tend to disable this optimization. (BS1 is a notable case when deletions would not disable this optimization) It might help to have a `nextClearBit` on `Bits` to be able to still apply this optimization. E.g. `MatchAllDocsQuery` could use `Bits#nextClearBit` on live docs to create a `DocIdStream` for every sequence of adjacent non-deleted doc IDs to speed up counting under sparse deletions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org