[PR] Spark: Adding simple custom partition sort order option to RewriteManifests Spark Action [iceberg]

via GitHub Thu, 15 Feb 2024 10:47:50 -0800


zachdisc opened a new pull request, #9731:
URL: https://github.com/apache/iceberg/pull/9731


   ## What
   This adds a simple `sort` method to the `RewriteManifests` spark action 
which lets user specify the partition column order to consider when grouping 
manifests. 
   
   Illustration: 
   
   ```
   RewriteManifests.Result result =
           actions
               .rewriteManifests(table)
               .sort("c", "b", "a")  < -- this is the new api piece
               .execute();
   ```
   
   See issue https://github.com/apache/iceberg/issues/9615
   
   
   ## Why
   Iceberg's metadata is organized into a forest of manifest_files which point 
to data files sharing common partitions. By default, and during 
`RewriteManifests`, the partition grouping is determined by the default `Spec` 
partition order. If the primary query pattern is more aligned with the last 
partition in the table's spec, manifests are poorly suited to quickly plan and 
prune around those partitions. 
   
   EG 
   ```
   CREATE TABLE
   ...
   PARTITIONED BY (region, storeId, bucket(ipAddress, 100), days(event_time)
   ```
   Will create manifests that first group by `region`, whose `manifest_file` 
contents may span a wide range of `event_time` values. For a primary query 
pattern that doesn't care about `region`, `storeId`, etc, this leads to 
inefficient queries. 
   
   
   
   ## Requested Feedback and decisions
   * I chose to make the input to `sort` be the _raw column names_ used in 
partitioning, not the internal hidden ones. AKA `event_time` instead of 
`event_time_day`. `foo` instead of `foo_bucket_1234`. Thoughts? Could readily 
allow both, or just the real, hidden partition column names if people prefer 
   * I would next like to have a more capable functional interface `sort(row -> 
{... return groupingString})`, but was struggling to express a Java 
`Function`like input that could be used on a `DataSet<Row>`'s `data_file` 
struct - any pointers on how to parse a `Row`'s struct into a native Pojo and 
supply a function that can be used like in a UDF here would be apprecaited! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Spark: Adding simple custom partition sort order option to RewriteManifests Spark Action [iceberg]

Reply via email to