zhuqi-lucas opened a new pull request, #22751:
URL: https://github.com/apache/datafusion/pull/22751

   # Which issue does this PR close?
   
   PR 1 of 5 from the split agreed on 
https://github.com/apache/datafusion/pull/22706#issuecomment-4602892722. 
Related to #22682 / #22715 (full `GroupValuesColumn` type coverage). Closes 
nothing on its own; lays the dispatcher foundation.
   
   # Rationale for this change
   
   Preparation PR for the nested-type `GroupColumn` work. No new builders here. 
The goal is to refactor the dispatch in `GroupValuesColumn::intern` so 
subsequent PRs that add `FixedSizeList` / `Struct` / `List` / `LargeList` 
support can plug into one well-defined factory instead of growing the inline 
match in `intern`. Also includes two adjacent correctness fixes around the 
`Time32` / `Time64` variants that came up in the upstream thread.
   
   # What changes are included in this PR?
   
   1. **Factory extraction.** The inline match that maps each schema field to a 
`Box<dyn GroupColumn>` builder moves out of `GroupValuesColumn::intern` into a 
free function `make_group_column(field: &Field) -> Result<Box<dyn 
GroupColumn>>`. `intern` becomes a one-liner loop. Future nested-type 
specializations recursively call this factory for child field construction 
without enumerating every combination inline.
   
   2. **Time32 / Time64 supported_type alignment.** Previously `supported_type` 
matched `Time32(_)` (admitting the invalid Microsecond / Nanosecond 
combinations) and did not match `Time64(_)` at all, while the dispatcher 
accepted `Time32(Second / Millisecond)` and `Time64(Microsecond / Nanosecond)`. 
Tighten `supported_type` to the exact set the dispatcher constructs. The 
dispatcher's wildcard arms for invalid `Time` variants now return 
`not_impl_err` instead of silently doing nothing, so a schema that somehow 
reaches the dispatcher fails loudly rather than producing an empty builder 
vector.
   
   3. **`supported_type` ↔ `make_group_column` consistency fuzz.** New unit 
test `supported_type_and_make_group_column_stay_in_sync` iterates a 
representative set of 20 supported and 6 unsupported `DataType`s and asserts 
the biconditional. Pins the alignment so future contributors who add a type to 
one side without the other trip a unit test immediately.
   
   4. **dhat heap regression harness behind a new `dhat-heap` feature.** Adds 
an optional `dhat = "0.3"` dependency, a `dhat-heap` feature in 
`physical-plan/Cargo.toml`, a feature-gated global allocator in `lib.rs`, and a 
`dhat_tests` submodule that wraps a representative `GroupValuesColumn::intern` 
workload in `dhat::Profiler` and asserts peak heap stays under a 1 MiB budget 
on a 1k-row 2-col workload. Default `cargo test` does not compile or run dhat. 
Subsequent type-support PRs should tighten the budget to reflect their savings.
   
   # Are these changes tested?
   
   - `cargo test -p datafusion-physical-plan --lib aggregates::group_values` 
passes 30 tests (27 existing + 3 new: the consistency fuzz, the mixed-schema 
rejection, and the NotImpl propagation).
   - `cargo test -p datafusion-physical-plan --lib aggregates` passes 105 tests 
(no regression in the broader aggregate suite).
   - `cargo test -p datafusion-physical-plan --features dhat-heap 
aggregates::group_values::multi_group_by::dhat_tests` passes 1 test with peak 
heap well under the 1 MiB budget.
   - `cargo clippy -p datafusion-physical-plan --lib --tests --features 
dhat-heap -- -D warnings` is clean.
   
   # Are there any user-facing changes?
   
   No semantic change for any schema that already used `GroupValuesColumn` on 
main. The factory is the same dispatch logic pulled into a function; the `Time` 
changes only affect schemas that are semantically invalid Arrow types anyway.
   
   The `dhat-heap` feature is new but off by default, so consumers that do not 
enable it pay no overhead.
   
   # What this PR is NOT doing
   
   - No new `GroupColumn` builders. `FixedSizeList`, `Struct`, `List`, 
`LargeList` come in subsequent PRs of the #22682 sequence.
   - No new types in `supported_type`'s allow-list (the Time tightening removes 
invalid combinations rather than adding new ones).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to