anuragmantri opened a new pull request, #16750:
URL: https://github.com/apache/iceberg/pull/16750
This PR depends on #14948
This PR implements the Spark DSv2 SupportsReportOrdering API to report sort
order to Spark, enabling sort elimination for partitioned tables when reading
sorted Iceberg tables that have a defined sort order and files are written
respecting that order.
Sort order reporting can be enabled with:
```sql
SET spark.sql.iceberg.planning.preserve-data-ordering = true; (default false)
```
Implementation summary:
1. SortOrderAnalyzer validates two conditions before
SparkPartitioningAwareScan.outputOrdering() reports ordering to Spark:
- all files carry the current sort order ID
- each grouping key maps to exactly one task group (bin-packing must
not split partitions)
2. Merging Sorted Files: When ordering is reported, another PR (#14948) adds
MergingSortedRowDataReader to merge rows from multiple sorted files within a
partition using k-way merge. The plumbing for the merging reader
(SparkRowReaderFactory, SparkBatch) is included here.
Constraints:
1. When `preserve-data-ordering` is enabled, bin-packing of large
partitions
is disabled. All files within a partition are placed into a single Spark
task. This is a known limitation of the current KeyGroupedPartitioning
approach and is expected to be addressed in
[SPARK-56241](https://issues.apache.org/jira/browse/SPARK-56241).
2. Vectorized reads are disabled for partitions with more than one file
since k-way merge is row-based.
3. This implementation only reports sort order if files are sorted in the
current table sort order.
Depends on #14948 for MergingSortedRowDataReader.
AI Usage: I used Claude Opus 4.6 for code generation and writing tests. I
manually reviewed the generated code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]