[PR] Spark 4.1: Implement SupportsReportOrdering in scan to skip redundant sorts [iceberg]

via GitHub Wed, 20 May 2026 12:27:46 -0700


Shekharrajak opened a new pull request, #16454:
URL: https://github.com/apache/iceberg/pull/16454


   Ref #16430
   
   Iceberg stores `sort_order_id` per data file in the manifest, but the Spark 
scan never advertises this to the query planner. Spark inserts a redundant 
`Sort` node above every Iceberg scan, even when every file in the snapshot was 
written sorted.
   
   The **write side** has long produced sorted files via 
`RequiresDistributionAndOrdering` (#2165, #3720, #7637) and tags each file with 
its `sort_order_id` (#15150, #15832, #16308). The **read side** never closes 
the loop — that's what this PR fixes.
   
   `SparkPartitioningAwareScan` now implements `SupportsReportOrdering` (Spark 
3.3+ API). `outputOrdering()` returns the table's current `SortOrder` 
(converted via `SortOrderToSpark`) .
   
   Example : 
   
   ```
   CREATE TABLE events (user_id BIGINT, event_time TIMESTAMP) USING iceberg;
   ALTER TABLE events WRITE ORDERED BY event_time;
   INSERT INTO events SELECT * FROM source;
   
   EXPLAIN SELECT user_id, event_time,
                  ROW_NUMBER() OVER (ORDER BY event_time) AS rn
   FROM events;
   ```
   
   Before: 
   
   ```
   Window [row_number() OVER (ORDER BY event_time ASC)]
   +- Sort [event_time ASC NULLS FIRST], false, 0          ← redundant
      +- Exchange SinglePartition
         +- BatchScan events                               ← sort_order_id=1, 
ignored
   ```
   
   After this change: 
   
   ```
   Window [row_number() OVER (ORDER BY event_time ASC)]
   +- Exchange SinglePartition
      +- BatchScan events                                  ← 
outputOrdering=[event_time ASC]
                                                            (Sort eliminated)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Spark 4.1: Implement SupportsReportOrdering in scan to skip redundant sorts [iceberg]

Reply via email to