alamb commented on issue #23174: URL: https://github.com/apache/datafusion/issues/23174#issuecomment-4798447805
> Summary: DataFusion mostly uses repartition-based parallelism today, but at some point we need to introduce intra-partition parallelism, and we have to do that carefully for performance. Another way to describe this, that I prefer is "partition based parallelism" -- basically DataFusion tries to create a plan that will use `target_partition` number of partitions, which will then use that many cores during execution. I think there are some counter examples that don't live up to this exactly (e.g. wide fan in Unions and SortPreservingMerge) but otherwise it is largely the same and works well. Part of the reason DataFusion is so fast for the classic Scan-filter-aggregate is this partition parallelism (see my talk at TokioConf about Using Tokio for CPU-Bound Tasks (Works Really Well) [TokioConf 2026](https://www.tokioconf.com/) ( [slides](https://docs.google.com/presentation/d/1gnMDMuzVE-ESqFeBrQHVzspgFKgUsEDJ4cTgrX3hD0Q), and [recording](https://www.youtube.com/watch?v=FeqRdDG1Y7g)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
