kumarUjjawal opened a new issue, #21496:
URL: https://github.com/apache/datafusion/issues/21496
### Is your feature request related to a problem or challenge?
PR #21455, DataFrame::describe() no longer crashes on binary-like columns
(Binary, LargeBinary, BinaryView, FixedSizeBinary), but it now returns null for
min and max on thosecolumns.
That fix avoids an unsafe Cast(Binary → Utf8), but it leaves users with no
way to see the value range of a binary column in describe().
For columns that store hashes, UUIDs, content-addressed identifiers, or
fingerprints common uses of FixedSizeBinary(16) / FixedSizeBinary(32) knowing
the min/max value is genuinely useful for sanity-checking data.
The Utf8 cast is the wrong tool for this. Arrow correctly refuses to cast
arbitrary bytes to Utf8 because there is no general lossless mapping. But that
doesn't mean we have to give up on min/max for binary; we can render the bytes
as hex instead.
### Describe the solution you'd like
In DataFrame::describe():
1. Stop filtering Binary, LargeBinary, BinaryView out of the min/max
aggregations. These types are already supported by MinMaxBytesAccumulator, so
min(col) and max(col) produce a real binary scalar.
2. At the display step, special-case binary columns: instead of
cast(column, &DataType::Utf8), use Arrow's ArrayFormatter (or DisplayIndex)
which already renders these arrays as lowercase hex, which writes each byte via
{byte:02x}).
3. Update the describe schema so binary columns map to Utf8 output
containing the hex string (they already do — only the projection back into the
output column needs to change).
### Describe alternatives you've considered
- Keep returning null. Safe, but unhelpful — users have to write their own
SQL to get this information.
- Render binary as base64 instead of hex. Slightly more compact for long
values but less common for debugging/inspection. Hex matches Arrow's existing
display formatters, so it's
the lower-friction choice.
- Skip binary columns from describe entirely (don't even show
count/null_count). Worse than today — you lose information that already works
correctly.
- Add a format option to describe() to let the caller choose. Possible, but
premature — pick a sensible default first.
### Additional context
Generated by Codex
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]