zhuqi-lucas commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4178164670
Benchmark results for `sort_pushdown_sorted` (3 files, reversed naming, release build on GKE): | Query | Description | Main (ms) | PR (ms) | Change | |-------|-------------|-----------|---------|--------| | Q1 | ORDER BY ASC (full scan) | 161 | 182 | 1.13x slower | | Q2 | ORDER BY ASC LIMIT 100 | 13 | 2.8 | **4.73x faster** | | Q3 | SELECT * ORDER BY ASC (full scan) | 376 | 525 | 1.40x slower | | Q4 | SELECT * ORDER BY ASC LIMIT 100 | 55 | 6.6 | **8.36x faster** | **LIMIT queries**: 5-8x faster — sort elimination + limit pushdown means only the first ~100 rows are read before stopping. **Full scans**: 1.1-1.4x slower — after sort elimination, `output_ordering` is preserved, which prevents the repartitioning optimizer from splitting files into multiple byte-range partitions. The result is single-partition sequential I/O vs. main's multi-partition parallel I/O + sort. I'm going to work on fixing this in the current PR — the idea is to allow repartitioning to split non-overlapping files across partitions while preserving per-partition ordering. For most production workloads, ORDER BY queries come with LIMIT, so the 5-8x improvement is the dominant benefit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
