alamb commented on PR #21351: URL: https://github.com/apache/datafusion/pull/21351#issuecomment-4237610891
> Really cool! I'll try to allocate some time to this / the base PR. > > Let's also collect some follow-up work as well if we haven't yet! I think the latest PRs allow us to do things a bit differently and get the most out of it! > > Here some out the top of my head (I think roughly in order of importance): > > 1. Morsel splitting (more parallelism at the tail / small queries) / merging (small batch decoding/processing overhead) > 2. Prefetching IO / combining small IO requests (reducing `spawn_blocking` / thread switching overhead) > 3. .Implement morsel-based scan for other datasources > 4. Avoid eagerly executing sub-plans (now that we can extract more parallelism). Depends at least on 1. > 5. Move batch coalescing in RepartitionExec _before_ rather than after sending (reducing channel traffic / improving cache-friendliness) > > I am also feeling the change in execution might move bottlenecks to other parts (e.g. memory bandwidth, aggregation state, so some optimizations might be worth it now that didn't before, because it is easier to hit some limit...). Thank you. Yes I am quite pleased where this code is heading (aka it gives us a foundation for doing many other things) I think the most interesting one, as you suggest, is the morsel splitting, which would help especially for parquet files where the chunks are not ideally sized (e.g. one giant row group, for example) I played wiht some of that list on my prototype PR and I found it pretty tricky to get right -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
