Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

via GitHub Fri, 03 Apr 2026 00:40:04 -0700


Dandandan commented on PR #21330:
URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182348918


   > It's a little slower but the spill to disk goes from 600MB to almost 
nothing, I guess there are scenarios that could trade some latency for no 
spilling.
   > 
   > Unless the new Morsel approach improves on both latency and spilling, I 
wonder if we shouldn't make this option available via a configuration option 
(disabled by default)? WDYT @Dandandan?
   > 
   > cc: @gene-bordegaray, FYI in case it might be relevant for the distributed 
DF scenario you are working on, which heavily uses RepartitionExec too
   
   I didn't see the disk write difference, but I wonder if it is not due to 
something else (compiling?). @adriangb how is this computed? It does seem to 
reduce peak memory by a bit.
   
   I agree having it available as an option does make sense, consumers could 
opt in to the more conservative way.
   
   My hope / feeling with morsel-based scanning this actually can be removed 
(or we can convert the code to use some pipeline scheduler instead 🤔 and limit 
the pipeline parallelism (or limit it based on resource/memory usage) ), as 
driving a single pipeline at a time will be better in terms of 
cache-friendliness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Defer task spawning in RepartitionExec to first poll [datafusion]

Reply via email to