parthchandra commented on PR #3845: URL: https://github.com/apache/datafusion-comet/pull/3845#issuecomment-4218700657
> > If I understand it correctly, this is quite similar to what we did before switching to `interleave_batch`-based repartitioning. > > One concern is the memory bloat problem problem when the number of partitions is large. It is quite common to repartition the data into 1000s of partitions when working with large datasets in Spark. The memory reserved by the builder would consume lots of memory even before we start to append data into them. That was the motivation for me to implement the `interleave_batch`-based approach. See also #887 for details. > > The immediate mode works better for small number of partitions than `interleave_batch`-based approach, so I think it is a great to have feature. I would suggest that we invest in implementing sort-based repartitioning to scale better for large number of partitions and better performance. > > Thanks. This is good feedback. I will update the tuning guide to explain these trade offs between the two approaches. Can we configure a block size just for this shuffle to ease the memory pressure with a large number of partitions? Perhaps a benchmark with different numbers of partitions might improve our intuition? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
