xiangfu0 commented on PR #14698:
URL: https://github.com/apache/pinot/pull/14698#issuecomment-4151064054
## Performance Benchmark Results
Ran `BenchmarkGroupByCombineOperator` (JMH, 1 fork, 3 iterations × 5s, 32
segments, 10K records/segment, 8 threads) across four group-by cardinalities.
### Baseline vs Final (PARTITIONED algorithm)
Two optimization commits were added on top of the original PR:
**Commit 1 — `Fix LIMIT violation and lost trim signal`** (correctness fix,
no perf change)
- `finish()` now caps results to `_resultSize` for GROUP BY without ORDER BY
- `mergePartitionTable()` and `publishLocalPartitions()` now propagate
`_numResizes`/`_resizeTimeNs` from absorbed tables
**Commit 2 — `Cache Key.hashCode() + skip partition merge`** (~40-48%
improvement)
- Cache `Key.hashCode()`: compute `Arrays.hashCode()` once at construction,
O(1) thereafter. Eliminates redundant rehashing on every HashMap lookup,
partition routing, and merge.
- Skip `mergePartitionTables()`: finish each partition independently and
concatenate results via a zero-copy `CompositePartitionTable` iterator,
eliminating the O(N) `putAll` cost of merging all partition entries into a
single HashMap.
### Numbers: PARTITIONED Before → After Optimization
| Unique Groups | Before (ms/op) | After (ms/op) | Improvement |
|---:|---:|---:|---:|
| 500 | 6.6 | **3.4** | **1.9x (48%)** |
| 5,000 | 12.7 | **7.7** | **1.7x (39%)** |
| 50,000 | 45.7 | **28.9** | **1.6x (37%)** |
| 500,000 | 129.1 | **74.5** | **1.7x (42%)** |
### Numbers: All Three Algorithms Compared (After Optimization)
| Unique Groups | DEFAULT (ms) | NON-BLOCKING (ms) | PARTITIONED (ms) |
PARTITIONED vs DEFAULT |
|---:|---:|---:|---:|---:|
| 500 | 127.6 | 5.0 | **3.4** | **37.5x** |
| 5,000 | 132.4 | 10.0 | **7.7** | **17.2x** |
| 50,000 | 144.0 | 63.7 | **28.9** | **5.0x** |
| 500,000 | 184.8 | 136.7 | **74.5** | **2.5x** |
### Key Observations
1. **Low cardinality (500 groups)**: PARTITIONED is 37.5x faster than
DEFAULT — extreme contention on `ConcurrentHashMap` with 8 threads fighting
over 500 keys is completely eliminated by partitioning.
2. **High cardinality (500K groups)**: PARTITIONED is still 2.5x faster than
DEFAULT. The `CompositePartitionTable` optimization avoids the serial merge
bottleneck.
3. **PARTITIONED vs NON-BLOCKING**: PARTITIONED beats NON-BLOCKING at all
cardinalities above ~500 groups and is roughly tied at low cardinality.
### Benchmark Configuration
```
@Param({"500", "5000", "50000", "500000"}) _uniqueGroups
@Param({"10000"}) _numRecordsPerSegment
@Param({"8"}) _numThreads
32 segments, GROUP BY d1, d2 ORDER BY SUM(m1) DESC LIMIT 500
JVM: -server -Xms4G -Xmx4G
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]