Kevin-Li-2025 opened a new pull request, #23065:
URL: https://github.com/apache/datafusion/pull/23065

   ## Which issue does this PR close?
   
   Follow-up to #23043. This draft is stacked on that PR and should be reviewed 
after it merges.
   
   ## Rationale for this change
   
   Grouped `any_value` currently falls back to `GroupsAccumulatorAdapter`, 
which creates one boxed `Accumulator` per group, collects row indices, 
materializes per-group slices, and performs dynamic dispatch for every group. 
That overhead dominates high-cardinality `GROUP BY` workloads even though 
`any_value` only needs to retain one non-null value per group.
   
   This PR adds a native `GroupsAccumulator` that:
   
   - stores one `ScalarValue` and one `is_set` bit per group;
   - scans each input batch once and stops updating a group after its first 
valid value;
   - preserves the existing two-column partial-state contract;
   - supports filters, state merging, `EmitTo::First`, and `convert_to_state`; 
and
   - works for every Arrow type accepted by `any_value`.
   
   ## What changes are included?
   
   - Native grouped accumulator implementation and four focused unit tests.
   - Criterion benchmark comparing the native path with 
`GroupsAccumulatorAdapter` for Int64 and Utf8 at 8,192 rows / 4,096 groups.
   
   Local Apple Silicon benchmark medians:
   
   | Type | Native | Adapter | Improvement |
   | --- | ---: | ---: | ---: |
   | Int64 | 0.245 ms | 4.92 ms | ~20x |
   | Utf8 | 0.515 ms | 12-13 ms | >10x |
   
   The Utf8 adapter result is allocator-sensitive, so the claim is 
intentionally conservative.
   
   ## Validation
   
   - `cargo test -p datafusion-functions-aggregate --lib` (146 passed)
   - `cargo clippy -p datafusion-functions-aggregate --all-targets -- -D 
warnings`
   - `cargo fmt --all -- --check`
   - `cargo bench -p datafusion-functions-aggregate --bench any_value -- 
--noplot`
   - `git diff --check`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to