andygrove opened a new issue, #3836: URL: https://github.com/apache/datafusion-comet/issues/3836
## Description Currently, when writing shuffle data with many output partitions, each input batch gets split into many small per-partition batches (e.g. 8192 rows across 200 partitions ≈ ~41 rows per partition). Before serialization, we use Arrow's `BatchCoalescer` to combine these small batches into larger ones to reduce per-batch IPC overhead. However, `BatchCoalescer` allocates new arrays and copies row data to produce the coalesced batch. This is unnecessary work — the data is about to be serialized anyway. ## Proposed Optimization Explore writing multiple small batches as a single IPC record batch message by concatenating their Arrow buffers directly in the IPC writer, avoiding the intermediate copy. This would be a "coalesce-on-write" approach: - Instead of: small batches → `BatchCoalescer` (alloc + copy) → large batch → IPC serialize - Do: small batches → IPC serialize directly as one message (concatenate buffers) This would eliminate one full data copy per batch in the shuffle write hot path. ## Context This applies to both the block-based and IPC stream shuffle formats. The `BatchCoalescer` is used in `BufBatchWriter` (block format) and in the IPC stream multi-partition write path. Benchmark data (200 partitions, LZ4, 10M rows from TPC-H SF100 lineitem): - Block format: 2.64M rows/s, 609 MiB output - IPC stream format: 2.60M rows/s, 634 MiB output Both paths use `BatchCoalescer` and would benefit from this optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
