[PR] feat: use byte-based target batch size for shuffle IPC blocks [datafusion-comet]

via GitHub Wed, 08 Apr 2026 13:54:19 -0700


andygrove opened a new pull request, #3913:
URL: https://github.com/apache/datafusion-comet/pull/3913


   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   The native shuffle writer currently uses a row-based target batch size of 
8192 rows for coalescing small batches before writing IPC blocks. For narrow 
schemas (few columns, small data types), this produces tiny blocks with 
disproportionate per-block IPC schema overhead.
   
   A byte-based threshold ensures reasonably sized blocks regardless of schema 
width, improving shuffle write efficiency for narrow schemas without negatively 
impacting wide schemas.
   
   ## What changes are included in this PR?
   
   - Replace Arrow's `BatchCoalescer` (row-based) with byte-based accumulation 
in `BufBatchWriter` — batches are buffered until their total memory size 
reaches the target, then concatenated and written as a single IPC block
   - Switch `SinglePartitionShufflePartitioner` buffering from row-count to 
byte-size threshold
   - Add `target_batch_bytes` parameter to 
`MultiPartitionShuffleRepartitioner`, while keeping the row-based `batch_size` 
for scratch space sizing and input batch slicing (which is about processing 
chunk limits, not output block size)
   - Add `COMET_SHUFFLE_TARGET_BATCH_BYTES` config 
(`spark.comet.exec.shuffle.targetBatchBytes`, default 1 MiB)
   - Add `target_batch_bytes` field to `ShuffleWriter` protobuf message
   - Add `--target-batch-bytes` CLI arg to the standalone shuffle benchmark tool
   - Fix bug: `COMET_SHUFFLE_WRITE_BUFFER_SIZE` used `.max(Int.MaxValue)` 
instead of `.min(Int.MaxValue)` when converting Long to Int for protobuf, which 
always sent 2GB regardless of the configured value
   
   ## How are these changes tested?
   
   Existing shuffle tests (19 tests) all pass. The 
`test_batch_coalescing_reduces_size` test validates that byte-based coalescing 
still produces smaller output than no coalescing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: use byte-based target batch size for shuffle IPC blocks [datafusion-comet]

Reply via email to