zachdisc opened a new pull request, #12840:
URL: https://github.com/apache/iceberg/pull/12840

   **Note** this is a fresh PR replacing 
https://github.com/apache/iceberg/pull/9731. It had too much accumulated 
conflicts and changes, I rebased and messed it up. This is a clean start with 
all previous feedback incorporated. 
   
   ## What
   This adds a simple `sort` method to the `RewriteManifests` spark action 
which lets user specify the partition column order to consider when grouping 
manifests. 
   
   Illustration: 
   
   ```
   RewriteManifests.Result result =
           actions
               .rewriteManifests(table)
               .sort("c", "b", "a")  < -- this is the new api piece
               .execute();
   ```
   
   Closes https://github.com/apache/iceberg/issues/9615
   
   
   ## Why
   Iceberg's metadata is organized into a forest of manifest_files which point 
to data files sharing common partitions. By default, and during 
`RewriteManifests`, the partition grouping is determined by the default `Spec` 
partition order. If the primary query pattern is more aligned with the last 
partition in the table's spec, manifests are poorly suited to quickly plan and 
prune around those partitions. 
   
   EG 
   ```
   CREATE TABLE
   ...
   PARTITIONED BY (region, storeId, bucket(ipAddress, 100), days(event_time)
   ```
   Will create manifests that first group by `region`, whose `manifest_file` 
contents may span a wide range of `event_time` values. For a primary query 
pattern that doesn't care about `region`, `storeId`, etc, this leads to 
inefficient queries. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to