Re: [PR] Spark: Adding simple custom partition sort order option to RewriteManifests Spark Action [iceberg]

via GitHub Mon, 04 Mar 2024 18:37:35 -0800


zachdisc commented on code in PR #9731:
URL: https://github.com/apache/iceberg/pull/9731#discussion_r1512056069



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java:
##########
@@ -628,4 +613,50 @@ private Table table() {
       return tableBroadcast.value();
     }
   }
+
+  static class PartitionFieldSortFunction implements 
SerializableFunction<DataFile, String> {
+
+    private final PartitionSpec spec;
+    private final List<String> partitionFieldSortOrder;
+
+    public PartitionFieldSortFunction(PartitionSpec spec, List<String> 
partitionFieldSortOrder) {
+      this.spec = spec;
+      this.partitionFieldSortOrder = partitionFieldSortOrder;
+    }
+
+    @Override
+    public String apply(DataFile dataFile) {
+      Map<java.lang.String, Integer> fieldIndices = Maps.newHashMap();
+      int idx = 0;
+      for (PartitionField field : spec.fields()) {
+        fieldIndices.put(field.name(), idx);
+        idx++;
+      }
+
+      StringBuilder sb = new StringBuilder();
+      for (java.lang.String fieldName : partitionFieldSortOrder) {
+        Preconditions.checkArgument(
+            fieldIndices.containsKey(fieldName),
+            "Cannot find partition field with name %s in spec %s",
+            fieldName,
+            spec);
+        int fieldIndex = fieldIndices.get(fieldName);
+        PartitionField field = spec.fields().get(fieldIndex);
+        Type fieldType = spec.partitionType().field(field.fieldId()).type();
+        Class<?> javaClass = spec.javaClasses()[fieldIndex];
+        java.lang.String fieldValue =
+            field
+                .transform()
+                .toHumanString(fieldType, get(dataFile.partition(), 
fieldIndex, javaClass));
+        sb.append(fieldValue);

Review Comment:
   I actually had something very much like this ready to go earlier!
   
   The issue with this approach (which was non obvious to me at first) was that 
`repartitionByRange` on a set of strings doesn't know about 
relativity/hirearchy. As you point out, this isn't really "sorting", but 
"clustering". 
   
   From R3: 
   
   > Changed the clustering column to be a struct referencing the actual 
columns instead of a contacted string p1::p2::p3.
   > Thanks to @ Nandhakumar Saravana for pointing this out, when clustering by 
a string a::b::c spark doesn't know about relative priority/hierarchy. AKA 
a::b::c should be close to a::b::d, which it does when clustering by actual 
columns and not a String literal.
   
   What happens when you use a single String like this is that you don't get 
good clustering on the keys in the order that you specify, unless you happen to 
have > 1 manifest files' worth of manifests to write. 
   
   In practice, this means that a table that is initially defined as 
`PARTITIONED BY (a, b, c)` will have much better metadata organization than one 
`PARTITIONED BY (c, b, a)` with `RewriteManifests.sort(a, b, c)`. Which is the 
behavior I'm shoting for here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Adding simple custom partition sort order option to RewriteManifests Spark Action [iceberg]

Reply via email to