adriangb commented on PR #21182:
URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4178251973

   > **Full scans**: 1.1-1.4x slower — after sort elimination, 
`output_ordering` is preserved, which prevents the repartitioning optimizer 
from splitting files into multiple byte-range partitions. The result is 
single-partition sequential I/O vs. main's multi-partition parallel I/O + sort. 
I'm going to work on fixing this in the current PR — the idea is to allow 
repartitioning to split non-overlapping files across partitions while 
preserving per-partition ordering.
   
   Just re-ran a couple times, results are reproducible. I think your 
hypothesis is right. One option I'll offer is that if we need to split this PR 
up and have some of it (the `Exact` cases that eliminate the sort) wait until 
we have morselization or something that'd be an option. Regardless we should 
think about how this is going to interact with morselization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to