kumarUjjawal opened a new issue, #21496:
URL: https://github.com/apache/datafusion/issues/21496

   ### Is your feature request related to a problem or challenge?
   
   PR #21455, DataFrame::describe() no longer crashes on binary-like columns 
(Binary, LargeBinary, BinaryView, FixedSizeBinary), but it now returns null for 
min and max on thosecolumns.
   
   That fix avoids an unsafe Cast(Binary → Utf8), but it leaves users with no 
way to see the value range of a binary column in describe().
   
   For columns that store hashes, UUIDs, content-addressed identifiers, or 
fingerprints  common uses of FixedSizeBinary(16) / FixedSizeBinary(32) knowing 
the min/max value is genuinely useful for sanity-checking data.
   
   The Utf8 cast is the wrong tool for this. Arrow correctly refuses to cast 
arbitrary bytes to Utf8 because there is no general lossless mapping. But that 
doesn't mean we have to give up on min/max for binary; we can render the bytes 
as hex instead.
   
   ### Describe the solution you'd like
   
   In DataFrame::describe():
   
     1. Stop filtering Binary, LargeBinary, BinaryView out of the min/max 
aggregations. These types are already supported by MinMaxBytesAccumulator, so 
min(col) and max(col) produce a real binary scalar.
     2. At the display step, special-case binary columns: instead of 
cast(column, &DataType::Utf8), use Arrow's ArrayFormatter (or DisplayIndex) 
which already renders these arrays as lowercase hex, which writes each byte via 
{byte:02x}).
     3. Update the describe schema so binary columns map to Utf8 output 
containing the hex string (they already do — only the projection back into the 
output column needs to change).
   
   
   
   ### Describe alternatives you've considered
   
   - Keep returning null. Safe, but unhelpful — users have to write their own 
SQL to get this information.
   - Render binary as base64 instead of hex. Slightly more compact for long 
values but less common for debugging/inspection. Hex matches Arrow's existing 
display formatters, so it's
     the lower-friction choice.
   - Skip binary columns from describe entirely (don't even show 
count/null_count). Worse than today — you lose information that already works 
correctly.
   - Add a format option to describe() to let the caller choose. Possible, but 
premature — pick a sensible default first.
   
   
   ### Additional context
   
   Generated by Codex


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to