karuppayya commented on issue #3882: URL: https://github.com/apache/datafusion-comet/issues/3882#issuecomment-4184628771
Spark mappers cannot compress by file granularity since the reducers need their respective shuffle blocks.(Comet seems also build the IPC blocks per shuffle partition) [Spark](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/serializer/SerializerManager.scala#L158-L160) compresses it at Shuffle block level. And for Lz4 compression, its uses `spark.io.compression.lz4.blockSize` (default 32K) as threshold for lz4 blocks, whereas in Comet we use the row count for batch size(IPC blocks), which I guess this is by design of the Arrow format. > then use CometScan to hand out slices at batch_size number of rows. Is this because the read is the memory intensive operation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
