adriangb opened a new issue, #21982:
URL: https://github.com/apache/datafusion/issues/21982
### Describe the bug
`make_array(col)` and `array_agg(col)` both return a `List<T>` whose
inner-Field metadata is empty, even when the source column's `Field` carries
metadata (e.g. `ARROW:extension:name` / `ARROW:extension:metadata` for an Arrow
extension type such as `arrow.json`, `arrow.uuid`, etc.).
Concretely, this means SQL-constructed lists silently drop Arrow
extension-type identity, and any downstream operation that compares the
produced `DataType::List(field)` against a list type carrying inner-field
metadata (e.g. union, aggregate merging, IPC roundtrip) will see them as
different types.
### Root cause
Both functions implement only the type-only `return_type` hook and never
look at input fields, so they have no way to read source metadata:
- `datafusion/functions-nested/src/make_array.rs:97-106` — `return_type`
returns `DataType::new_list(element_type, true)`.
- `datafusion/functions-aggregate/src/array_agg.rs:107-112` — `return_type`
returns `DataType::List(Arc::new(Field::new_list_field(arg_types[0].clone(),
true)))`.
The runtime paths construct fresh fields with no metadata as well:
- `array_array` in `make_array.rs:237` builds
`Arc::new(Field::new(field_name, data_type, true))`.
- The various `array_agg` accumulators (`array_agg.rs:663`, `:734`, etc.)
build `Arc::new(Field::new_list_field(self.datatype.clone(), true))`.
Both UDF (`return_field_from_args`) and UDAF (`return_field`) hooks accept
input `FieldRef`s and could propagate metadata, but neither function overrides
them.
### To Reproduce
Write a parquet file whose scalar column carries field-level metadata, then
aggregate it from SQL.
```python
# pip install pyarrow
import pyarrow as pa
import pyarrow.parquet as pq
field = pa.field("s", pa.string(), nullable=True, metadata={
b"ARROW:extension:name": b"arrow.json",
b"ARROW:extension:metadata": b"{}",
})
schema = pa.schema([field])
batch = pa.record_batch([pa.array(["alpha", "beta"])], schema=schema)
pq.write_table(pa.Table.from_batches([batch]),
"/tmp/scalar_meta/input.parquet")
```
```sql
-- /tmp/q.sql
CREATE EXTERNAL TABLE src
STORED AS PARQUET
LOCATION '/tmp/scalar_meta/input.parquet';
COPY (SELECT make_array(s) AS m FROM src) TO
'/tmp/scalar_meta/make_array.parquet' STORED AS PARQUET;
COPY (SELECT array_agg(s) AS m FROM src) TO
'/tmp/scalar_meta/array_agg.parquet' STORED AS PARQUET;
```
Run with `cargo run -p datafusion-cli -- -f /tmp/q.sql`, then inspect the
outputs:
```python
for path in ["/tmp/scalar_meta/make_array.parquet",
"/tmp/scalar_meta/array_agg.parquet"]:
f = pq.read_table(path).schema.field(0)
print(path, "inner metadata:", f.type.field(0).metadata)
```
### Expected behavior
Inner-field metadata of the output `List<T>` should be propagated from the
source field. For the repro above:
```
inner metadata: {b'ARROW:extension:name': b'arrow.json',
b'ARROW:extension:metadata': b'{}'}
```
### Actual output
```
/tmp/scalar_meta/make_array.parquet inner metadata: None
/tmp/scalar_meta/array_agg.parquet inner metadata: None
```
### Suggested fix
- `MakeArray`: override `return_field_from_args` so the returned `List`
field carries an inner field cloned (metadata + nullability) from
`arg_fields[0]`. Thread that field through `invoke_with_args` so `array_array`
constructs the runtime `ListArray` with the same inner field, instead of
synthesizing one from a `&str` name and `DataType` alone.
- `ArrayAgg`: override `return_field` similarly. The single-row, group-by,
ordered, and distinct accumulators each construct their output list field from
`self.datatype` — they need to use a stored `FieldRef` that preserves the
input's metadata.
### Additional context
This is closely related to #21981 (`MinAccumulator` over `List<T>` drops
inner-Field metadata for non-null row groups) — both stem from list
constructors and accumulators that throw away input-field metadata when
synthesizing output fields. Fixing the propagation here would also let users
write a pure-SQL repro for #21981 starting from a scalar-metadata parquet
column instead of a `List<T>`-metadata one.
Verified on `main` at f0430920c.
### Component(s)
`functions-nested`, `functions-aggregate`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]