adriangb opened a new pull request, #21984:
URL: https://github.com/apache/datafusion/pull/21984

   ## Which issue does this PR close?
   
   - Closes #21982.
   
   ## Rationale for this change
   
   Several built-in UDFs and UDAFs that wrap an input column into a composite 
type (`List`, `Struct`, `Map`, …) drop the input field's metadata when 
constructing the output's inner Field. In practice this matters most for Arrow 
extension types (`ARROW:extension:name` / `ARROW:extension:metadata` — e.g. 
`arrow.json`, `arrow.uuid`), because SQL-constructed lists/structs/maps 
silently lose extension-type identity. Any downstream operation that compares 
the produced `DataType` against a type carrying inner-field metadata (union, 
aggregate merging, IPC roundtrip) sees them as different types.
   
   Each affected function has the same shape: only `return_type` 
(DataType-only) is overridden, and the runtime path synthesizes a fresh inner 
`Field` with no metadata. The fix is therefore systematic rather than 
per-function.
   
   ## What changes are included in this PR?
   
   Two new helpers in `datafusion-common::utils`:
   
   - `list_inner_field_from(&Field) -> FieldRef` — builds the canonical inner 
field of a `List`/`LargeList`/`FixedSizeList` from a source field, preserving 
the source's data type and metadata.
   - `struct_inner_fields_from(...) -> Fields` — builds named struct member 
fields from a sequence of `(name, &Field)` pairs, preserving each input's 
metadata.
   
   `SingleRowListArrayBuilder::with_field` was extended to also propagate 
metadata.
   
   These helpers are then used consistently from each affected function's 
`return_field_from_args` (UDF) / `return_field` (UDAF), and the 
metadata-bearing inner field is threaded into the runtime construction paths:
   
   | Function | Change |
   | --- | --- |
   | `make_array` | new `return_field_from_args`; runtime uses 
`args.return_field`'s inner field |
   | `array_agg` (incl. distinct, ordered, groups accumulator) | new 
`return_field` and `state_fields`; accumulators now carry a `FieldRef` instead 
of a bare `DataType` and use `list_inner_field_from` at every list-construction 
site |
   | `array_repeat` | new `return_field_from_args`; runtime threads inner field 
through |
   | `arrays_zip` | new `return_field_from_args` (preserves struct member 
metadata); runtime uses planning-time struct fields |
   | `map` | new `return_field_from_args`; `entries` field threaded through 
`make_map_array_internal` / `make_map_array_from_fixed_size_list` / 
`build_map_array` |
   | `range` / `generate_series` | new `return_field_from_args`; runtime grafts 
the planning-time inner field onto results from `ListBuilder`-based paths |
   | `struct` | new `return_field_from_args` (preserves member field metadata) |
   
   `spark.collect_list` / `collect_set` (which reuse `ArrayAggAccumulator` / 
`DistinctArrayAggAccumulator`) were updated to pass the input `FieldRef` 
through.
   
   ## Are these changes tested?
   
   Yes:
   
   - Rust unit tests in each affected file directly assert metadata propagation 
through `return_field_from_args` / `return_field` (e.g. 
`make_array_preserves_inner_field_metadata`, 
`array_agg_preserves_inner_field_metadata`, `struct_preserves_member_metadata`, 
`arrays_zip_preserves_struct_member_metadata`, 
`array_repeat_preserves_inner_field_metadata`, 
`map_preserves_key_value_field_metadata`, 
`range_preserves_inner_field_metadata`).
   - New end-to-end SLT 
`datafusion/sqllogictest/test_files/array_metadata_propagation.slt` exercises 
every affected constructor through `with_metadata` → wrapping function → 
`arrow_field(...)['data_type']`, asserting the rendered data type contains the 
expected inner-field metadata. This covers `make_array`, `array_repeat`, 
`range`, `generate_series`, `struct`, `arrays_zip`, `map`, and `array_agg` 
(default, `DISTINCT`, and `ORDER BY` paths).
   - All existing tests in `datafusion-common`, `datafusion-functions`, 
`datafusion-functions-aggregate`, `datafusion-functions-nested` continue to 
pass; the full SLT suite (465 files) passes; clippy is clean.
   
   ## Are there any user-facing changes?
   
   Yes — but they are bug fixes rather than breaking changes: SQL-constructed 
lists/structs/maps now retain Arrow extension-type identity from their input 
fields. The accumulator constructors (`ArrayAggAccumulator::try_new`, 
`DistinctArrayAggAccumulator::try_new`, 
`OrderSensitiveArrayAggAccumulator::try_new`, `ArrayAggGroupsAccumulator::new`) 
now take `&FieldRef` instead of `&DataType`; this is a `pub` API change for 
downstream code that constructs these accumulators directly.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to