zachdisc opened a new pull request, #9731: URL: https://github.com/apache/iceberg/pull/9731
## What This adds a simple `sort` method to the `RewriteManifests` spark action which lets user specify the partition column order to consider when grouping manifests. Illustration: ``` RewriteManifests.Result result = actions .rewriteManifests(table) .sort("c", "b", "a") < -- this is the new api piece .execute(); ``` See issue https://github.com/apache/iceberg/issues/9615 ## Why Iceberg's metadata is organized into a forest of manifest_files which point to data files sharing common partitions. By default, and during `RewriteManifests`, the partition grouping is determined by the default `Spec` partition order. If the primary query pattern is more aligned with the last partition in the table's spec, manifests are poorly suited to quickly plan and prune around those partitions. EG ``` CREATE TABLE ... PARTITIONED BY (region, storeId, bucket(ipAddress, 100), days(event_time) ``` Will create manifests that first group by `region`, whose `manifest_file` contents may span a wide range of `event_time` values. For a primary query pattern that doesn't care about `region`, `storeId`, etc, this leads to inefficient queries. ## Requested Feedback and decisions * I chose to make the input to `sort` be the _raw column names_ used in partitioning, not the internal hidden ones. AKA `event_time` instead of `event_time_day`. `foo` instead of `foo_bucket_1234`. Thoughts? Could readily allow both, or just the real, hidden partition column names if people prefer * I would next like to have a more capable functional interface `sort(row -> {... return groupingString})`, but was struggling to express a Java `Function`like input that could be used on a `DataSet<Row>`'s `data_file` struct - any pointers on how to parse a `Row`'s struct into a native Pojo and supply a function that can be used like in a UDF here would be apprecaited! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org