andygrove commented on issue #1287:
URL:
https://github.com/apache/datafusion-comet/issues/1287#issuecomment-4603587196
This appears to be resolved. The codebase adopted the first remedy proposed
here ("Always unpack string dictionaries during ScanExec"):
- `ScanExec::new` now builds its schema from unpacked data types (`scan.rs`:
*"Build schema directly from data types since get_next now always unpacks
dictionaries"*) via `schema_from_data_types` → `unpack_dictionary_type`.
- `ScanExec::get_next` unconditionally unpacks each array
(`copy_or_unpack_array(..., UnpackOrClone)`).
So the partial stage no longer emits dictionary types into aggregation, and
the partial/final stages now agree on the plain (e.g. `Utf8`) type. The
concrete reproducer (`max` on a dictionary-encoded string column) no longer
applies. This was further hardened by #2635, which made the fuzz generator
produce dictionary-encoded string arrays and fixed the bugs that exposed,
giving us regression coverage for this class of mismatch.
One caveat for completeness: the unpacking is top-level only
(`copy_or_unpack_array` / `unpack_dictionary_type` match a
`DataType::Dictionary` at the column root), so a dictionary nested inside a
`List`/`Struct` would not be unpacked. In practice Comet's Parquet/exchange
path dictionary-encodes at the top level, so this is theoretical. If anyone can
produce a nested-dictionary input that reaches aggregation, that's worth a
fresh, narrowly-scoped issue.
Closing as resolved by #2635. Please reopen if the mismatch is still
reproducible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]