zachdisc opened a new pull request, #12840: URL: https://github.com/apache/iceberg/pull/12840
**Note** this is a fresh PR replacing https://github.com/apache/iceberg/pull/9731. It had too much accumulated conflicts and changes, I rebased and messed it up. This is a clean start with all previous feedback incorporated. ## What This adds a simple `sort` method to the `RewriteManifests` spark action which lets user specify the partition column order to consider when grouping manifests. Illustration: ``` RewriteManifests.Result result = actions .rewriteManifests(table) .sort("c", "b", "a") < -- this is the new api piece .execute(); ``` Closes https://github.com/apache/iceberg/issues/9615 ## Why Iceberg's metadata is organized into a forest of manifest_files which point to data files sharing common partitions. By default, and during `RewriteManifests`, the partition grouping is determined by the default `Spec` partition order. If the primary query pattern is more aligned with the last partition in the table's spec, manifests are poorly suited to quickly plan and prune around those partitions. EG ``` CREATE TABLE ... PARTITIONED BY (region, storeId, bucket(ipAddress, 100), days(event_time) ``` Will create manifests that first group by `region`, whose `manifest_file` contents may span a wide range of `event_time` values. For a primary query pattern that doesn't care about `region`, `storeId`, etc, this leads to inefficient queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org