Re: [I] Improve internal worker parallelism support [datafusion]

via GitHub Thu, 25 Jun 2026 20:07:40 -0700


2010YOUY01 commented on issue #23174:
URL: https://github.com/apache/datafusion/issues/23174#issuecomment-4806026533


   > I think the property that the partitioning / parallelism is explicit in 
the plan is a core one in DataFusion
   > 
   > Another way we could consider modeling the usecases above (e.g. a single 
partition in a WindowFunction) is to keep the multiple partitions, but 
internally use shared state, similar to how `RepartitionExec` or `JoinExec`, 
and `FileInputStream`. That way we keep the parallelism tied to the plan's 
structure (partition_count) but the streams executing each partition can 
dynamically adapt during plan time to better use resources
   > 
   > Concretely, maybe instead of
   > 
   > ```
   > (any downstream exec)
   > -- RepartitionExec(round-robin on batch, input_partitions=1, 
otuput_partitions=32)
   > ---- WindowExec(partition=1, internal_parallelism=32)
   > ------ CoalescePartitionExec(input_partitions=32, output_partitions=1)
   > -------- CsvExec(partition=32, internal_parallelism=1)
   > ```
   > 
   > We had something like
   > 
   > ```
   > (any downstream exec)
   > ---- WindowExec(partition=32)
   > -------- CsvExec(partition=32)
   > ```
   > 
   > And then **internally** within `WindowExec(partition=32, 
internal_parallelism=32)` it knew enough to coalesce the data into a single 
partition, and then split the work across the multiple partition streams (with 
a segment tree or whatever)
   > 
   > I don't think this is fundamentally different than a single exec with 
internal paralleism, but I think it keeps task/cpi model the same
   
   I think this simplified model is a good idea in single-node cases. My 
concern is that, for distributed use cases, shared-state parallelism usually 
isn't easily applicable, so we may want to explicitly separate partition 
parallelism from internal parallelism to make the adaptation easier.
   
   cc @gabotechs who has been working on `datafusion-distributed`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve internal worker parallelism support [datafusion]

Reply via email to