alamb commented on issue #23174:
URL: https://github.com/apache/datafusion/issues/23174#issuecomment-4798447805

   > Summary: DataFusion mostly uses repartition-based parallelism today, but 
at some point we need to introduce intra-partition parallelism, and we have to 
do that carefully for performance.
   
   Another way to describe this, that I prefer is "partition based parallelism" 
-- basically DataFusion tries to create a plan that will use `target_partition` 
number of partitions, which will then use that many cores during execution.
   
   I think there are some counter examples that don't live up to this exactly 
(e.g. wide fan in Unions and SortPreservingMerge) but otherwise it is largely 
the same and works well. Part of the reason DataFusion is so fast for the 
classic Scan-filter-aggregate is this partition parallelism (see my talk at 
TokioConf about Using Tokio for CPU-Bound Tasks (Works Really Well) [TokioConf 
2026](https://www.tokioconf.com/) ( 
[slides](https://docs.google.com/presentation/d/1gnMDMuzVE-ESqFeBrQHVzspgFKgUsEDJ4cTgrX3hD0Q),
 and [recording](https://www.youtube.com/watch?v=FeqRdDG1Y7g))
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to