aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1019738831


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,75 @@ public Void alwaysNull(int fieldId, String sourceName, 
int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds a grouping key type considering all provided specs.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct 
containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>A grouping key defines how data is split between files and consists of 
partition fields with
+   * non-void transforms that are present in each provided spec. Iceberg 
guarantees that records
+   * with different values for the grouping key are disjoint and are stored in 
separate files.
+   *
+   * <p>If there is only one spec, the grouping key will include all partition 
fields with non-void
+   * transforms from that spec. Whenever there are multiple specs, the 
grouping key will represent
+   * an intersection of all partition fields with non-void transforms. If a 
partition field is
+   * present only in a subset of specs, Iceberg cannot guarantee data 
distribution on that field.
+   * That's why it will not be part of the grouping key. Unpartitioned tables 
or tables with
+   * non-overlapping specs have empty grouping keys.
+   *
+   * <p>When partition fields are dropped in v1 tables, they are replaced with 
new partition fields
+   * that have the same field ID but use a void transform under the hood. Such 
fields cannot be part
+   * of the grouping key as void transforms always return null.
+   *
+   * @param specs one or many specs
+   * @return the constructed grouping key type
+   */
+  public static StructType groupingKeyType(Collection<PartitionSpec> specs) {

Review Comment:
   @sunchao, we should be able to check if this type is empty to decide if can 
report a distribution to Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to