Re: [I] Current shuffle format has too much overhead with default batch size [datafusion-comet]

via GitHub Fri, 03 Apr 2026 11:19:55 -0700


karuppayya commented on issue #3882:
URL: 
https://github.com/apache/datafusion-comet/issues/3882#issuecomment-4184628771


   Spark mappers cannot compress by file granularity since the reducers need  
their respective shuffle blocks.(Comet seems also build the IPC blocks per 
shuffle partition)
   
[Spark](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/serializer/SerializerManager.scala#L158-L160)
 compresses it at Shuffle block level.
   And for Lz4 compression, its uses `spark.io.compression.lz4.blockSize` 
(default 32K) as threshold for lz4 blocks, whereas in Comet we use the row 
count for batch size(IPC blocks), which I guess this is by design of the Arrow 
format. 
   
   > then use CometScan to hand out slices at batch_size number of rows.
   
   Is this because the read is the memory intensive operation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Current shuffle format has too much overhead with default batch size [datafusion-comet]

Reply via email to