alamb commented on PR #21828: URL: https://github.com/apache/datafusion/pull/21828#issuecomment-4347003410
> I understand where the need comes from, but there is a good reason why databases treat scans without order by as unordered, it's because a lot of logical/physical planning optimizations depend on this assumption, and they can only rely on metadata to tell if the plan changes they want to do are safe or not. I agree with @asolimando on this -- and I think DataFusion should not be breaking new ground on what semantics we implement (we should follow other DB implementations as much as possible) > If the underlying data is truly sorted over something that can be encoded similarly to what you can write with an ORDER BY (or at least producing the same metadata DataFusion uses), that's could be fine, but if the order is just the order rows happen to have in the files, and we can't encode this promise nowhere, then it gets complex. We did recently add the ability to emit row id from the parquet reader 🤔 -- maybe we could make that work and then treat row group skipping as an optimization when the data is explicitly `ORDER BY row_number()` 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
