andygrove opened a new pull request, #3911:
URL: https://github.com/apache/datafusion-comet/pull/3911

   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   The native columnar shuffle currently uses a custom per-block format 
(length-prefixed compressed IPC messages) that requires manual framing, 
per-block schema headers, and a custom Java reader 
(`CometShuffleBlockIterator`). Replacing this with standard Arrow IPC streams 
eliminates custom serialization code, enables built-in IPC body compression 
(zstd/lz4), and allows the shuffle reader to use Arrow's `StreamReader` 
directly.
   
   ## What changes are included in this PR?
   
   **Write path:**
   - Replace `ShuffleBlockWriter` with Arrow IPC `StreamWriter` in all 
partitioners (single, multi, empty-schema)
   - Each partition's data in the shuffle file is now a complete IPC stream 
(schema + batches + EOS marker)
   - Small batches are coalesced via `BufBatchWriter` before serialization
   - Enable `ipc_compression` feature in the Arrow dependency for built-in 
zstd/lz4 body compression
   
   **Read path (native):**
   - Add `JniInputStream`: a Rust `Read` impl that pulls bytes from a JVM 
`InputStream` via JNI with 64KB read-ahead buffering
   - Add `ShuffleStreamReader`: manages reading potentially concatenated IPC 
streams (from spills) using Arrow's `StreamReader`
   - Update `ShuffleScanExec` to lazily create a `ShuffleStreamReader` instead 
of calling per-block decode methods
   
   **Read path (JVM):**
   - Replace `CometShuffleBlockIterator` (custom Java reader) with a simple 
`InputStream` passed to native via JNI
   - Simplify `NativeBatchDecoderIterator` to open a stream, read batches, and 
close — no more per-block ByteBuffer management
   - Add `openShuffleStream`, `nextShuffleStreamBatch`, 
`shuffleStreamNumFields`, `closeShuffleStream` JNI methods
   
   **Cleanup:**
   - Remove `ShuffleBlockWriter`, `CometShuffleBlockIterator`, and shuffle 
block iterator JNI bridge code
   - Remove legacy configuration options 
(`COMET_COLUMNAR_SHUFFLE_ASYNC_ENABLED`, 
`COMET_SHUFFLE_PREFER_DICTIONARY_RATIO`)
   
   ## How are these changes tested?
   
   - Existing shuffle Rust unit tests updated and passing (19 tests)
   - Existing JVM shuffle integration tests provide end-to-end coverage
   - Clippy clean with `-D warnings`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to