Dandandan opened a new pull request, #21816:
URL: https://github.com/apache/datafusion/pull/21816

   ## Summary
   - Opt-in `datafusion.execution.emit_aggregate_group_hash` (default 
**false**). When enabled, a Partial `AggregateExec` feeding a Hash 
`RepartitionExec` over the same group columns emits a trailing `UInt64` column 
of precomputed row hashes (seeded with `REPARTITION_RANDOM_STATE`).
   - `RepartitionExec::Hash` consumes the column directly via a new fast path, 
eliminating one full rehashing pass on the shuffle. Biggest wins on 
string/binary group keys (e.g. Clickbench `regexp_replace` keys).
   - New optimizer rule `EmitPartialAggregateHash` flips the flag on matching 
`Partial → Hash-Repartition` pairs. Config-gated so default behavior is 
unchanged.
   
   ## What the rule does
   ```
   Partial AggregateExec (emits __datafusion_precomputed_hash)
     └── RepartitionExec: Hash([...group cols...], N)   ← uses precomputed hash
           └── ...
   ```
   - Hash column sits at the end of Partial output, tagged with field metadata 
(`datafusion.precomputed_hash = "repartition_seed_0"` plus a 
`datafusion.precomputed_hash_cols` CSV of source indices).
   - Repartition matches either "partitioning expr IS the hash column" or 
"partitioning exprs are the recorded source columns, in order".
   - Final's indexing into group/state columns is unaffected (hash sits after 
states); its own output schema has no hash column.
   
   ## Out of scope / follow-ups
   - FinalPartitioned `GroupValues` reuse. Would need a `GroupValues` trait 
extension + a second hash column seeded with `AGGREGATION_HASH_SEED` to 
preserve the intentional seed difference between the shuffle hash and the 
agg-side probe hash.
   - Benchmarks on this PR will tell us whether the Repartition-side saving 
alone moves the needle on Clickbench before we invest in the GroupValues work.
   
   ## Test plan
   - [x] Unit: Partial emits the tagged column, values match 
`create_hashes(&group_arrays, REPARTITION_RANDOM_STATE)`
   - [x] Unit: `with_emit_group_hash(true)` is a no-op on non-Partial modes
   - [x] Unit: `detect_precomputed_hash_column` matches (a) 
hash-column-as-partitioning-expr and (b) multi-column group with source 
indices, and rejects subset/reordering
   - [x] Unit: optimizer rule fires only when config is enabled and groups 
match; skips when partitioning expr is non-Column
   - [x] `cargo fmt --all` + `cargo clippy -p datafusion-physical-plan -p 
datafusion-physical-optimizer -p datafusion-common --all-targets -- -D warnings`
   - [ ] Benchmarks (pending — see comment)
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to