andygrove opened a new issue, #3836:
URL: https://github.com/apache/datafusion-comet/issues/3836

   ## Description
   
   Currently, when writing shuffle data with many output partitions, each input 
batch gets split into many small per-partition batches (e.g. 8192 rows across 
200 partitions ≈ ~41 rows per partition). Before serialization, we use Arrow's 
`BatchCoalescer` to combine these small batches into larger ones to reduce 
per-batch IPC overhead.
   
   However, `BatchCoalescer` allocates new arrays and copies row data to 
produce the coalesced batch. This is unnecessary work — the data is about to be 
serialized anyway.
   
   ## Proposed Optimization
   
   Explore writing multiple small batches as a single IPC record batch message 
by concatenating their Arrow buffers directly in the IPC writer, avoiding the 
intermediate copy. This would be a "coalesce-on-write" approach:
   
   - Instead of: small batches → `BatchCoalescer` (alloc + copy) → large batch 
→ IPC serialize
   - Do: small batches → IPC serialize directly as one message (concatenate 
buffers)
   
   This would eliminate one full data copy per batch in the shuffle write hot 
path.
   
   ## Context
   
   This applies to both the block-based and IPC stream shuffle formats. The 
`BatchCoalescer` is used in `BufBatchWriter` (block format) and in the IPC 
stream multi-partition write path.
   
   Benchmark data (200 partitions, LZ4, 10M rows from TPC-H SF100 lineitem):
   - Block format: 2.64M rows/s, 609 MiB output
   - IPC stream format: 2.60M rows/s, 634 MiB output
   
   Both paths use `BatchCoalescer` and would benefit from this optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to