RussellSpitzer commented on code in PR #12840:
URL: https://github.com/apache/iceberg/pull/12840#discussion_r2054920719


##########
api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java:
##########
@@ -44,6 +45,28 @@ public interface RewriteManifests
    */
   RewriteManifests rewriteIf(Predicate<ManifestFile> predicate);
 
+  /**
+   * Rewrite manifests in a given order, based on partition field names
+   *
+   * <p>Supply an optional set of partition field names to cluster the 
rewritten manifests by. For
+   * example, given a table PARTITIONED BY (a, b, c, d), one may wish to 
rewrite and cluster
+   * manifests by ('d', 'b') only, based on known query patterns. Rewriting 
Manifests in this way
+   * will yield manifest_lists that point to manifest_files containing data 
files for common 'd' and
+   * 'b' partitions.
+   *
+   * <p>If not set, manifests will be rewritten in the order of the transforms 
in the table's
+   * current partition spec.
+   *
+   * @param partitionFields Exact transformed column names used for 
partitioning; not the raw column
+   *     names that partitions are derived from. E.G. supply 'data_bucket' and 
not 'data' for a
+   *     bucket(N, data) partition * definition
+   * @return this for method chaining
+   */
+  default RewriteManifests clusterBy(List<String> partitionFields) {

Review Comment:
   Hmm that's a good question. In my eyes we are doing a hierarchical sort 
which feels different to me than a multi-dimensional clustering algo. So for 
example Cluster(a, b) might get me manifests with common tuples where A and B 
are correlated but we can't actually do that here. 
   
   So for example if I would expect cluster to make files like
   ```
   {(1,1)(1,2)(2,1)(2,2)} 
   {(1,3)(1,4)(2,3)(2,4)}
   {(3,1)(3,2)(4,1)(4,2)}
   {(3,3)(3,4)(4,3)(4,4)}
   ```
   I would consider that clustered
   
   But our current algo can't do that, it can only do a hierarchical sort, each 
column is dependent on the one before it. Like in the above example if I 
cluster (a,b) I would produce
   
   ```
   {(1,1)(1,2)(1,3)(1,4)}
   {(2,1)(2,2)(2,3)(2,4)}
   {(3,1)(3,2)(3,3)(3,4)}
   {(4,1)(4,2)(4,3)(4,4)}
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to