asolimando commented on PR #21330:
URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182359402

   > > It's a little slower but the spill to disk goes from 600MB to almost 
nothing, I guess there are scenarios that could trade some latency for no 
spilling.
   > > Unless the new Morsel approach improves on both latency and spilling, I 
wonder if we shouldn't make this option available via a configuration option 
(disabled by default)? WDYT @Dandandan?
   > > cc: @gene-bordegaray, FYI in case it might be relevant for the 
distributed DF scenario you are working on, which heavily uses RepartitionExec 
too
   > 
   > I didn't see the disk write difference, but I wonder if it is not due to 
something else (compiling?). @adriangb how is this computed? It does seem to 
reduce peak memory by a bit for TPC-DS though.
   > 
   > I agree having it available as an option does make sense, consumers could 
opt in to the more conservative way.
   > 
   > My hope / feeling with morsel-based scanning this actually can be removed 
(or we can convert the code to use some pipeline scheduler instead 🤔 and limit 
the pipeline parallelism (or limit it based on resource/memory usage) ), as 
driving a single pipeline at a time will be better in terms of 
cache-friendliness.
   
   Indeed I took a look at a few benchmark runs and it's mostly the case that 
the main branch has a larger spill to disk (only in one case it was 
comparable), so it might very well be an artifact.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to