richardstartin commented on a change in pull request #8411: URL: https://github.com/apache/pinot/pull/8411#discussion_r837850389
########## File path: pinot-core/src/main/java/org/apache/pinot/core/operator/filter/BitmapBasedFilterOperator.java ########## @@ -100,6 +101,67 @@ protected FilterBlock getNextBlock() { } } + @Override + public boolean canOptimizeCount() { + return true; + } + + @Override + public int getNumMatchingDocs() { + int count = 0; + if (_invertedIndexReader == null) { + count = _docIds.getCardinality(); + } else { + int[] dictIds = _exclusive + ? _predicateEvaluator.getNonMatchingDictIds() + : _predicateEvaluator.getMatchingDictIds(); + switch (dictIds.length) { + case 0: + break; + case 1: { + count = _invertedIndexReader.getDocIds(dictIds[0]).getCardinality(); + break; + } + case 2: { + count = ImmutableRoaringBitmap.orCardinality(_invertedIndexReader.getDocIds(dictIds[0]), + _invertedIndexReader.getDocIds(dictIds[1])); + break; + } + default: { + // this could be optimised if the bitmaps are known to be disjoint (as in a single value bitmap index) + MutableRoaringBitmap bitmap = new MutableRoaringBitmap(); + for (int dictId : dictIds) { + bitmap.or(_invertedIndexReader.getDocIds(dictId)); + } + count = bitmap.getCardinality(); + break; + } + } + } + return _exclusive ? _numDocs - count : count; + } + + @Override + public boolean canProduceBitmaps() { + return true; + } + + @Override + public BitmapCollection getBitmaps() { + if (_docIds == null) { + int[] dictIds = _exclusive + ? _predicateEvaluator.getNonMatchingDictIds() + : _predicateEvaluator.getMatchingDictIds(); + ImmutableRoaringBitmap[] bitmaps = new ImmutableRoaringBitmap[dictIds.length]; + for (int i = 0; i < dictIds.length; i++) { + bitmaps[i] = (_invertedIndexReader.getDocIds(dictIds[i])); Review comment: Perhaps I should clarify the reasons not to merge bitmaps eagerly 1. `or` is expensive, its size increases monotonically 2. flipping bits is expensive, it maps sparse to dense by filling in holes 3. (not implemented yet) it's always better to distribute intersections across unions because it increases sparsity before doing the union 4. it may never be necessary to do the `or` at all, it can be implemented with `andNot` from a universe bitmap, this approach is generally 2x faster than eagerly computing an `or` 5. these bitmaps are all just handles around mapped memory and may never need to be copied -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org