adriangb opened a new issue, #21982:
URL: https://github.com/apache/datafusion/issues/21982

   ### Describe the bug
   
   `make_array(col)` and `array_agg(col)` both return a `List<T>` whose 
inner-Field metadata is empty, even when the source column's `Field` carries 
metadata (e.g. `ARROW:extension:name` / `ARROW:extension:metadata` for an Arrow 
extension type such as `arrow.json`, `arrow.uuid`, etc.).
   
   Concretely, this means SQL-constructed lists silently drop Arrow 
extension-type identity, and any downstream operation that compares the 
produced `DataType::List(field)` against a list type carrying inner-field 
metadata (e.g. union, aggregate merging, IPC roundtrip) will see them as 
different types.
   
   ### Root cause
   
   Both functions implement only the type-only `return_type` hook and never 
look at input fields, so they have no way to read source metadata:
   
   - `datafusion/functions-nested/src/make_array.rs:97-106` — `return_type` 
returns `DataType::new_list(element_type, true)`.
   - `datafusion/functions-aggregate/src/array_agg.rs:107-112` — `return_type` 
returns `DataType::List(Arc::new(Field::new_list_field(arg_types[0].clone(), 
true)))`.
   
   The runtime paths construct fresh fields with no metadata as well:
   
   - `array_array` in `make_array.rs:237` builds 
`Arc::new(Field::new(field_name, data_type, true))`.
   - The various `array_agg` accumulators (`array_agg.rs:663`, `:734`, etc.) 
build `Arc::new(Field::new_list_field(self.datatype.clone(), true))`.
   
   Both UDF (`return_field_from_args`) and UDAF (`return_field`) hooks accept 
input `FieldRef`s and could propagate metadata, but neither function overrides 
them.
   
   ### To Reproduce
   
   Write a parquet file whose scalar column carries field-level metadata, then 
aggregate it from SQL.
   
   ```python
   # pip install pyarrow
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   field = pa.field("s", pa.string(), nullable=True, metadata={
       b"ARROW:extension:name": b"arrow.json",
       b"ARROW:extension:metadata": b"{}",
   })
   schema = pa.schema([field])
   batch = pa.record_batch([pa.array(["alpha", "beta"])], schema=schema)
   pq.write_table(pa.Table.from_batches([batch]), 
"/tmp/scalar_meta/input.parquet")
   ```
   
   ```sql
   -- /tmp/q.sql
   CREATE EXTERNAL TABLE src
   STORED AS PARQUET
   LOCATION '/tmp/scalar_meta/input.parquet';
   
   COPY (SELECT make_array(s) AS m FROM src) TO 
'/tmp/scalar_meta/make_array.parquet' STORED AS PARQUET;
   COPY (SELECT array_agg(s)  AS m FROM src) TO 
'/tmp/scalar_meta/array_agg.parquet'  STORED AS PARQUET;
   ```
   
   Run with `cargo run -p datafusion-cli -- -f /tmp/q.sql`, then inspect the 
outputs:
   
   ```python
   for path in ["/tmp/scalar_meta/make_array.parquet", 
"/tmp/scalar_meta/array_agg.parquet"]:
       f = pq.read_table(path).schema.field(0)
       print(path, "inner metadata:", f.type.field(0).metadata)
   ```
   
   ### Expected behavior
   
   Inner-field metadata of the output `List<T>` should be propagated from the 
source field. For the repro above:
   
   ```
   inner metadata: {b'ARROW:extension:name': b'arrow.json', 
b'ARROW:extension:metadata': b'{}'}
   ```
   
   ### Actual output
   
   ```
   /tmp/scalar_meta/make_array.parquet inner metadata: None
   /tmp/scalar_meta/array_agg.parquet  inner metadata: None
   ```
   
   ### Suggested fix
   
   - `MakeArray`: override `return_field_from_args` so the returned `List` 
field carries an inner field cloned (metadata + nullability) from 
`arg_fields[0]`. Thread that field through `invoke_with_args` so `array_array` 
constructs the runtime `ListArray` with the same inner field, instead of 
synthesizing one from a `&str` name and `DataType` alone.
   - `ArrayAgg`: override `return_field` similarly. The single-row, group-by, 
ordered, and distinct accumulators each construct their output list field from 
`self.datatype` — they need to use a stored `FieldRef` that preserves the 
input's metadata.
   
   ### Additional context
   
   This is closely related to #21981 (`MinAccumulator` over `List<T>` drops 
inner-Field metadata for non-null row groups) — both stem from list 
constructors and accumulators that throw away input-field metadata when 
synthesizing output fields. Fixing the propagation here would also let users 
write a pure-SQL repro for #21981 starting from a scalar-metadata parquet 
column instead of a `List<T>`-metadata one.
   
   Verified on `main` at f0430920c.
   
   ### Component(s)
   
   `functions-nested`, `functions-aggregate`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to