xiangfu0 opened a new pull request, #17996:
URL: https://github.com/apache/pinot/pull/17996
## Summary
This PR implements **Phase 0** of the multi-stage group-by performance
roadmap. No query semantic changes are made — this is purely additive
(benchmarks, stats, tests).
### What's included
**JMH benchmarks**
(`pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkMSEGroupByKeyGen.java`)
Covers all key-generator code paths:
| Scenario | Generator used |
|---|---|
| `SINGLE_INT` | `OneIntKeyGroupIdGenerator` (Int2IntOpenHashMap) |
| `SINGLE_STRING` | `OneObjectKeyGroupIdGenerator` (Object2IntOpenHashMap) |
| `TWO_INT_KEYS` | `TwoKeysGroupIdGenerator` (Long2IntOpenHashMap, packed
longs) |
| `FOUR_INT_KEYS` | `MultiKeysGroupIdGenerator`
(Object2IntOpenHashMap<FixedIntArray>) |
| `SKEWED` | `OneIntKeyGroupIdGenerator` with Zipfian key distribution |
| `SORTED_COLUMN` | `OneIntKeyGroupIdGenerator` simulating sorted-column
shortcut |
Two benchmark modes per scenario:
- `keyGenInsert` — build map from scratch (reset each invocation, measures
insert throughput)
- `keyGenSteadyState` — warm map lookups only (measures lookup throughput at
saturation)
Two result-materialization benchmarks:
- `materializeTopK` — ORDER BY sum DESC LIMIT 100 via min-heap PriorityQueue
- `materializeAll` — plain full-table truncation via hash-map iterator
**Observability stats** (zero overhead when not collected)
Added to `AggregateOperator.StatKey`:
- `ROWS_IN` (LONG) — total input rows processed by this stage
- `GENERATOR_TYPE` (STRING) — which `GroupIdGenerator` backend was selected
**Unit tests** (`pinot-query-runtime`)
- `GroupIdGeneratorTest` (21 tests) — correctness harness covering null
handling, limit enforcement, iterator fidelity, and factory routing for all
generator variants
- `MultistageGroupByCombineLimitTest` (9 tests) — top-K materialization
tests including `testLeafTrimCausesIncorrectGlobalTopK`, a **regression test
that documents the known open bug** where leaf-level LIMIT K trimming causes
the intermediate stage to return the wrong global top-K
### What's NOT changed
- No query plan changes
- No wire protocol changes
- No behaviour changes for existing queries
- All new flags/stats are off or inert by default
### Running the benchmarks
```bash
# Build perf module
./mvnw install -pl pinot-perf -am -DskipTests
# Run key-gen benchmarks (all scenarios, ~5 min)
java -jar pinot-perf/target/benchmarks.jar BenchmarkMSEGroupByKeyGen \
-f 1 -wi 3 -i 5 -tu us
# Run only single-int scenario
java -jar pinot-perf/target/benchmarks.jar BenchmarkMSEGroupByKeyGen \
-f 1 -wi 3 -i 5 -tu us \
-p scenario=SINGLE_INT
```
### Test commands
```bash
./mvnw -pl pinot-query-runtime -Dtest=GroupIdGeneratorTest test
./mvnw -pl pinot-query-runtime -Dtest=MultistageGroupByCombineLimitTest test
```
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]