asolimando commented on PR #21330: URL: https://github.com/apache/datafusion/pull/21330#issuecomment-4182359402
> > It's a little slower but the spill to disk goes from 600MB to almost nothing, I guess there are scenarios that could trade some latency for no spilling. > > Unless the new Morsel approach improves on both latency and spilling, I wonder if we shouldn't make this option available via a configuration option (disabled by default)? WDYT @Dandandan? > > cc: @gene-bordegaray, FYI in case it might be relevant for the distributed DF scenario you are working on, which heavily uses RepartitionExec too > > I didn't see the disk write difference, but I wonder if it is not due to something else (compiling?). @adriangb how is this computed? It does seem to reduce peak memory by a bit for TPC-DS though. > > I agree having it available as an option does make sense, consumers could opt in to the more conservative way. > > My hope / feeling with morsel-based scanning this actually can be removed (or we can convert the code to use some pipeline scheduler instead 🤔 and limit the pipeline parallelism (or limit it based on resource/memory usage) ), as driving a single pipeline at a time will be better in terms of cache-friendliness. Indeed I took a look at a few benchmark runs and it's mostly the case that the main branch has a larger spill to disk (only in one case it was comparable), so it might very well be an artifact. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
