adriangb opened a new issue, #22225:
URL: https://github.com/apache/datafusion/issues/22225

   ## Is your feature request related to a problem or challenge?
   
   `arrays_zip` (`arrays_zip_inner_with_field` in 
`datafusion/functions-nested/src/arrays_zip.rs`) always assembles its output by 
walking every row through per-column `MutableArrayData` builders, copying each 
input slice one row at a time (`builder.extend(0, start, end)`) and padding 
shorter rows with NULLs (`builder.extend_nulls(...)`).
   
   When the inputs form a **perfect zip** — every input array has identical 
per-row element lengths, no null list rows with non-zero element slots, and 
therefore no null padding is needed — this row-by-row copy is wasted work. In 
that case the resulting struct child columns are bit-identical to the 
(concatenated) input value arrays, and the list offsets are identical to the 
inputs' offsets.
   
   ## Describe the solution you'd like
   
   Detect the perfect-zip case up front and skip the `MutableArrayData` path 
entirely:
   
   - Build the output struct child columns directly from the original input 
value `ArrayRef`s (clone / concat, no per-row copy).
   - Reuse an input array's offset buffer for the resulting `ListArray` instead 
of rebuilding it.
   
   This keeps the existing general path as a fallback for the ragged / 
null-padded cases.
   
   ## Describe alternatives you've considered
   
   Keep the current always-copy implementation. It is correct but does 
avoidable work for the common case where all zipped arrays line up.
   
   ## Additional context
   
   Raised by @paleolimbot while reviewing #21984:
   
   > Not here, but for the perfect zip (all value arrays the same length, no 
nulls with non-zero element slot lengths, no null padding needed) this should 
ideally be just clones of the original arrayrefs
   
   Split out of #21984 (a metadata-propagation bugfix) since this is an 
orthogonal performance optimization that warrants its own benchmarks and 
edge-case tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to