zhuqi-lucas commented on PR #21182:
URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4178164670

   Benchmark results for `sort_pushdown_sorted` (3 files, reversed naming, 
release build on GKE):
   
   | Query | Description | Main (ms) | PR (ms) | Change |
   |-------|-------------|-----------|---------|--------|
   | Q1 | ORDER BY ASC (full scan) | 161 | 182 | 1.13x slower |
   | Q2 | ORDER BY ASC LIMIT 100 | 13 | 2.8 | **4.73x faster** |
   | Q3 | SELECT * ORDER BY ASC (full scan) | 376 | 525 | 1.40x slower |
   | Q4 | SELECT * ORDER BY ASC LIMIT 100 | 55 | 6.6 | **8.36x faster** |
   
   **LIMIT queries**: 5-8x faster — sort elimination + limit pushdown means 
only the first ~100 rows are read before stopping.
   
   **Full scans**: 1.1-1.4x slower — after sort elimination, `output_ordering` 
is preserved, which prevents the repartitioning optimizer from splitting files 
into multiple byte-range partitions. The result is single-partition sequential 
I/O vs. main's multi-partition parallel I/O + sort. I'm going to work on fixing 
this in the current PR — the idea is to allow repartitioning to split 
non-overlapping files across partitions while preserving per-partition ordering.
   
   For most production workloads, ORDER BY queries come with LIMIT, so the 5-8x 
improvement is the dominant benefit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to