adriangb opened a new pull request, #21984: URL: https://github.com/apache/datafusion/pull/21984
## Which issue does this PR close? - Closes #21982. ## Rationale for this change Several built-in UDFs and UDAFs that wrap an input column into a composite type (`List`, `Struct`, `Map`, …) drop the input field's metadata when constructing the output's inner Field. In practice this matters most for Arrow extension types (`ARROW:extension:name` / `ARROW:extension:metadata` — e.g. `arrow.json`, `arrow.uuid`), because SQL-constructed lists/structs/maps silently lose extension-type identity. Any downstream operation that compares the produced `DataType` against a type carrying inner-field metadata (union, aggregate merging, IPC roundtrip) sees them as different types. Each affected function has the same shape: only `return_type` (DataType-only) is overridden, and the runtime path synthesizes a fresh inner `Field` with no metadata. The fix is therefore systematic rather than per-function. ## What changes are included in this PR? Two new helpers in `datafusion-common::utils`: - `list_inner_field_from(&Field) -> FieldRef` — builds the canonical inner field of a `List`/`LargeList`/`FixedSizeList` from a source field, preserving the source's data type and metadata. - `struct_inner_fields_from(...) -> Fields` — builds named struct member fields from a sequence of `(name, &Field)` pairs, preserving each input's metadata. `SingleRowListArrayBuilder::with_field` was extended to also propagate metadata. These helpers are then used consistently from each affected function's `return_field_from_args` (UDF) / `return_field` (UDAF), and the metadata-bearing inner field is threaded into the runtime construction paths: | Function | Change | | --- | --- | | `make_array` | new `return_field_from_args`; runtime uses `args.return_field`'s inner field | | `array_agg` (incl. distinct, ordered, groups accumulator) | new `return_field` and `state_fields`; accumulators now carry a `FieldRef` instead of a bare `DataType` and use `list_inner_field_from` at every list-construction site | | `array_repeat` | new `return_field_from_args`; runtime threads inner field through | | `arrays_zip` | new `return_field_from_args` (preserves struct member metadata); runtime uses planning-time struct fields | | `map` | new `return_field_from_args`; `entries` field threaded through `make_map_array_internal` / `make_map_array_from_fixed_size_list` / `build_map_array` | | `range` / `generate_series` | new `return_field_from_args`; runtime grafts the planning-time inner field onto results from `ListBuilder`-based paths | | `struct` | new `return_field_from_args` (preserves member field metadata) | `spark.collect_list` / `collect_set` (which reuse `ArrayAggAccumulator` / `DistinctArrayAggAccumulator`) were updated to pass the input `FieldRef` through. ## Are these changes tested? Yes: - Rust unit tests in each affected file directly assert metadata propagation through `return_field_from_args` / `return_field` (e.g. `make_array_preserves_inner_field_metadata`, `array_agg_preserves_inner_field_metadata`, `struct_preserves_member_metadata`, `arrays_zip_preserves_struct_member_metadata`, `array_repeat_preserves_inner_field_metadata`, `map_preserves_key_value_field_metadata`, `range_preserves_inner_field_metadata`). - New end-to-end SLT `datafusion/sqllogictest/test_files/array_metadata_propagation.slt` exercises every affected constructor through `with_metadata` → wrapping function → `arrow_field(...)['data_type']`, asserting the rendered data type contains the expected inner-field metadata. This covers `make_array`, `array_repeat`, `range`, `generate_series`, `struct`, `arrays_zip`, `map`, and `array_agg` (default, `DISTINCT`, and `ORDER BY` paths). - All existing tests in `datafusion-common`, `datafusion-functions`, `datafusion-functions-aggregate`, `datafusion-functions-nested` continue to pass; the full SLT suite (465 files) passes; clippy is clean. ## Are there any user-facing changes? Yes — but they are bug fixes rather than breaking changes: SQL-constructed lists/structs/maps now retain Arrow extension-type identity from their input fields. The accumulator constructors (`ArrayAggAccumulator::try_new`, `DistinctArrayAggAccumulator::try_new`, `OrderSensitiveArrayAggAccumulator::try_new`, `ArrayAggGroupsAccumulator::new`) now take `&FieldRef` instead of `&DataType`; this is a `pub` API change for downstream code that constructs these accumulators directly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
