Re: [PR] Spark: Adding simple custom partition sort order option to RewriteManifests Spark Action [iceberg]

via GitHub Tue, 15 Oct 2024 20:38:08 -0700


ZachDischner commented on code in PR #9731:
URL: https://github.com/apache/iceberg/pull/9731#discussion_r1802296391



##########
api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java:
##########
@@ -44,6 +47,43 @@ public interface RewriteManifests
    */
   RewriteManifests rewriteIf(Predicate<ManifestFile> predicate);
 
+  /**
+   * Rewrite manifests in a given order, based on partition field names
+   *
+   * <p>Supply an optional set of partition field names to cluster the 
rewritten manifests by. For
+   * example, given a table PARTITIONED BY (a, b, c, d), you may wish to 
rewrite and cluster
+   * manifests by ('d', 'b') only, based on your query patterns. Rewriting 
Manifests in this way
+   * will yield manifest_lists that point to manifest_files containing data 
files for common 'd' and
+   * 'b' partitions.
+   *
+   * <p>If not set, manifests will be rewritten in the order of the transforms 
in the table's
+   * current partition spec.
+   *
+   * @param partitionFieldClustering Exact transformed column names used for 
partitioning; not the
+   *     raw column names that partitions are derived from. E.G. supply 
'data_bucket' and not 'data'
+   *     for a bucket(N, data) partition * definition
+   * @return this for method chaining
+   */
+  default RewriteManifests clusterBy(List<String> partitionFieldClustering) {
+    throw new UnsupportedOperationException(
+        this.getClass().getName() + " doesn't implement 
clusterBy(List<String>)");
+  }
+
+  /**
+   * Rewrite manifests in a given order, dictated by a custom Function
+   *
+   * <p>Supply a Function which will apply its own custom clustering logic 
based on supplied {@link
+   * org.apache.iceberg.DataFile} attributes.
+   *
+   * @param clusterStrategyFunction A Function that returns a String to be 
used for manifest
+   *     clustering
+   * @return this method for chaining
+   */
+  default RewriteManifests clusterBy(Function<DataFile, String> 
clusterStrategyFunction) {

Review Comment:
   I can say that there are reasons to, but it is up to you to decide if they 
are very good reasons. 
   
   For example, I know that a primary use case for `bucket(someId, 10000)` 
partitioned tables will require reading `bucket_someId=1` and 
`bucket_someId=999` data together, so I'll want to cluster my planning around 
this use case. 
   
   Potentially simpler, say I'm partitioning by `month(timestamp)`. I can take 
the min/max values of a data file and cluster the files in a given month 
partition by day for more efficient query planning. This option lets power 
users take what Iceberg gives you to a new level. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: Adding simple custom partition sort order option to RewriteManifests Spark Action [iceberg]

Reply via email to