[PR] Avoid materializing the full bitmap for count aggregation with filter on column with inverted index [pinot]

via GitHub Wed, 20 May 2026 08:45:41 -0700


siddharthteotia opened a new pull request, #18543:
URL: https://github.com/apache/pinot/pull/18543


   ### Improvement Summary 
   - Optimizes `InvertedIndexFilterOperator.getNumMatchingDocs()` for 
single-value columns by replacing the materialized-union approach with a sum of 
per-dictId bitmap cardinalitie
   - Eliminates MutableRoaringBitmap allocation on the hot count path and 
yields speedups
   - Addresses an old TODO in the code referring to this optimization. 
   - Multi-value path is unchanged in this PR — an approach was prototyped and 
benchmarked but regressed at the edges. Will do a follow-up PR for 
thisthreshold-aware MV optimization will follow in a separate PR.
   
   ### Functional Testing
   - Added new unit test focused on specific code paths for 100% coverage
   - All existing SV and MV query tests that exercise this code path are 
passing. No new e2e query tests should be needed
   
   ### Performance Testing
   
   See BenchmarkInvertedIndexGetNumMatchingDocs 
   - JMH, numDocs=5M, Used -prof GC option with JMH
   - JDK 21, macOS Apple Silicon (re ran on x86 as well to cross check)
   - Dictionary cardinality ∈ {1024, 100K, 1M}
   - NumMatcingDictIDs (K) ∈ {4, 16, 64, 256, 1K, 10K, 100K}
   
   **SV Results**
   
   #### dict cardinality = 1K
   
     | K     |  current |  new | speedup | current alloc| new alloc |
     |------:|---------:|----------:|--------:|--------------:|---------------:|
     |     4 | 44.71 µs |     90 ns |   499x  |      137.4 KB |            0 B |
     |    16 |   738 µs |    530 ns |  1392x  |      565.5 KB |            0 B |
     |    64 |  8.39 ms |   4.12 µs |  2035x  |       3.36 MB |            0 B |
     |   256 |  9.65 ms |  16.57 µs |   582x  |       3.41 MB |            0 B |
     | 1,000 | 14.10 ms |  66.77 µs |   211x  |       3.41 MB |            0 B |
   
   dict cardinality = 100K
   
     | K       |   current |  new | speedup | current alloc | new alloc |
     
|--------:|----------:|----------:|--------:|--------------:|---------------:|
     |       4 |   3.51 µs |     63 ns |     56x |       13.1 KB |            0 
B |
     |      16 |  19.53 µs |    339 ns |     58x |       33.1 KB |            0 
B |
     |      64 | 103.73 µs |   3.05 µs |     34x |       63.4 KB |            0 
B |
     |     256 |   1.20 ms |  14.39 µs |     84x |      183.0 KB |            0 
B |
     |   1,000 |  11.77 ms |  64.41 µs |    183x |      547.9 KB |            0 
B |
     |  10,000 | 223.15 ms |   3.50 ms |     64x |       3.82 MB |           0 
B |
     | 100,000 | 428.49 ms |  36.62 ms |     12x |       3.85 MB |          0 B 
|
   
   dict cardinality = 1M
   
     | K       |   current |  new | speedup | current alloc  | new alloc |
     
|--------:|----------:|----------:|--------:|--------------:|---------------:|
     |       4 |   1.39 µs |     25 ns |     55x |        5.3 KB |            0 
B |
     |      16 |   6.17 µs |    120 ns |     51x |       12.8 KB |            0 
B |
     |      64 |  23.08 µs |    618 ns |     37x |       30.7 KB |            0 
B |
     |     256 | 132.59 µs |   3.81 µs |     35x |       53.8 KB |            0 
B |
     |   1,000 | 996.64 µs |  24.31 µs |     41x |      137.8 KB |            0 
B |
     |  10,000 |  44.85 ms | 260.38 µs |    172x |     1016.4 KB |            0 
B |
     | 100,000 | 342.36 ms |   9.36 ms |     37x |       4.00 MB |           32 
B |
   
   Note
   - Allocation is allocation per query (per segment per server). 
   - 3.36MB for a query touching 100 segments at 1000QPS is 336GB/sec of young 
gen pressure. 
   
   **MV Results (MV Implementation not changed in this PR but bench test has MV 
POC code)*
   
   - Tried with . BufferFastAggregation.orCardinality
   - This also allocates internally a scratch bitmap
   - Does a two-pass scan over all input bitmaps (pass 1 collects union 
container keys; pass 2 ORs the matching container from each input into a reused 
scratch container, per key)
   - At low K with tiny per-bitmap cardinality (sparse columns) the fixed 
scratch cost is unamortized. At very high K with high cardinality,  the per-key 
bitmap walk dominates. 
   - So the results were mixed (upto 7x faster but also 2x slower in some 
cases). 
   
   ### Backwards Compatibility
   
   - Compatible


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Avoid materializing the full bitmap for count aggregation with filter on column with inverted index [pinot]

Reply via email to