adriangb commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4178251973
> **Full scans**: 1.1-1.4x slower — after sort elimination, `output_ordering` is preserved, which prevents the repartitioning optimizer from splitting files into multiple byte-range partitions. The result is single-partition sequential I/O vs. main's multi-partition parallel I/O + sort. I'm going to work on fixing this in the current PR — the idea is to allow repartitioning to split non-overlapping files across partitions while preserving per-partition ordering. Just re-ran a couple times, results are reproducible. I think your hypothesis is right. One option I'll offer is that if we need to split this PR up and have some of it (the `Exact` cases that eliminate the sort) wait until we have morselization or something that'd be an option. Regardless we should think about how this is going to interact with morselization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
