Re: [PR] feat: Add immediate mode option for native shuffle [datafusion-comet]

via GitHub Thu, 09 Apr 2026 17:06:20 -0700


parthchandra commented on PR #3845:
URL: 
https://github.com/apache/datafusion-comet/pull/3845#issuecomment-4218700657


   > > If I understand it correctly, this is quite similar to what we did 
before switching to `interleave_batch`-based repartitioning.
   > > One concern is the memory bloat problem problem when the number of 
partitions is large. It is quite common to repartition the data into 1000s of 
partitions when working with large datasets in Spark. The memory reserved by 
the builder would consume lots of memory even before we start to append data 
into them. That was the motivation for me to implement the 
`interleave_batch`-based approach. See also #887 for details.
   > > The immediate mode works better for small number of partitions than 
`interleave_batch`-based approach, so I think it is a great to have feature. I 
would suggest that we invest in implementing sort-based repartitioning to scale 
better for large number of partitions and better performance.
   > 
   > Thanks. This is good feedback. I will update the tuning guide to explain 
these trade offs between the two approaches.
   
   Can we configure a block size just for this shuffle to ease the memory 
pressure with a large number of partitions? Perhaps a benchmark with different 
numbers of partitions might improve our intuition? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Add immediate mode option for native shuffle [datafusion-comet]

Reply via email to