siddharthteotia opened a new pull request, #18543:
URL: https://github.com/apache/pinot/pull/18543
### Improvement Summary
- Optimizes `InvertedIndexFilterOperator.getNumMatchingDocs()` for
single-value columns by replacing the materialized-union approach with a sum of
per-dictId bitmap cardinalitie
- Eliminates MutableRoaringBitmap allocation on the hot count path and
yields speedups
- Addresses an old TODO in the code referring to this optimization.
- Multi-value path is unchanged in this PR — an approach was prototyped and
benchmarked but regressed at the edges. Will do a follow-up PR for
thisthreshold-aware MV optimization will follow in a separate PR.
### Functional Testing
- Added new unit test focused on specific code paths for 100% coverage
- All existing SV and MV query tests that exercise this code path are
passing. No new e2e query tests should be needed
### Performance Testing
See BenchmarkInvertedIndexGetNumMatchingDocs
- JMH, numDocs=5M, Used -prof GC option with JMH
- JDK 21, macOS Apple Silicon (re ran on x86 as well to cross check)
- Dictionary cardinality ∈ {1024, 100K, 1M}
- NumMatcingDictIDs (K) ∈ {4, 16, 64, 256, 1K, 10K, 100K}
**SV Results**
#### dict cardinality = 1K
| K | current | new | speedup | current alloc| new alloc |
|------:|---------:|----------:|--------:|--------------:|---------------:|
| 4 | 44.71 µs | 90 ns | 499x | 137.4 KB | 0 B |
| 16 | 738 µs | 530 ns | 1392x | 565.5 KB | 0 B |
| 64 | 8.39 ms | 4.12 µs | 2035x | 3.36 MB | 0 B |
| 256 | 9.65 ms | 16.57 µs | 582x | 3.41 MB | 0 B |
| 1,000 | 14.10 ms | 66.77 µs | 211x | 3.41 MB | 0 B |
dict cardinality = 100K
| K | current | new | speedup | current alloc | new alloc |
|--------:|----------:|----------:|--------:|--------------:|---------------:|
| 4 | 3.51 µs | 63 ns | 56x | 13.1 KB | 0
B |
| 16 | 19.53 µs | 339 ns | 58x | 33.1 KB | 0
B |
| 64 | 103.73 µs | 3.05 µs | 34x | 63.4 KB | 0
B |
| 256 | 1.20 ms | 14.39 µs | 84x | 183.0 KB | 0
B |
| 1,000 | 11.77 ms | 64.41 µs | 183x | 547.9 KB | 0
B |
| 10,000 | 223.15 ms | 3.50 ms | 64x | 3.82 MB | 0
B |
| 100,000 | 428.49 ms | 36.62 ms | 12x | 3.85 MB | 0 B
|
dict cardinality = 1M
| K | current | new | speedup | current alloc | new alloc |
|--------:|----------:|----------:|--------:|--------------:|---------------:|
| 4 | 1.39 µs | 25 ns | 55x | 5.3 KB | 0
B |
| 16 | 6.17 µs | 120 ns | 51x | 12.8 KB | 0
B |
| 64 | 23.08 µs | 618 ns | 37x | 30.7 KB | 0
B |
| 256 | 132.59 µs | 3.81 µs | 35x | 53.8 KB | 0
B |
| 1,000 | 996.64 µs | 24.31 µs | 41x | 137.8 KB | 0
B |
| 10,000 | 44.85 ms | 260.38 µs | 172x | 1016.4 KB | 0
B |
| 100,000 | 342.36 ms | 9.36 ms | 37x | 4.00 MB | 32
B |
Note
- Allocation is allocation per query (per segment per server).
- 3.36MB for a query touching 100 segments at 1000QPS is 336GB/sec of young
gen pressure.
**MV Results (MV Implementation not changed in this PR but bench test has MV
POC code)*
- Tried with . BufferFastAggregation.orCardinality
- This also allocates internally a scratch bitmap
- Does a two-pass scan over all input bitmaps (pass 1 collects union
container keys; pass 2 ORs the matching container from each input into a reused
scratch container, per key)
- At low K with tiny per-bitmap cardinality (sparse columns) the fixed
scratch cost is unamortized. At very high K with high cardinality, the per-key
bitmap walk dominates.
- So the results were mixed (upto 7x faster but also 2x slower in some
cases).
### Backwards Compatibility
- Compatible
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]