zachdisc commented on code in PR #9731: URL: https://github.com/apache/iceberg/pull/9731#discussion_r1512056069
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java: ########## @@ -628,4 +613,50 @@ private Table table() { return tableBroadcast.value(); } } + + static class PartitionFieldSortFunction implements SerializableFunction<DataFile, String> { + + private final PartitionSpec spec; + private final List<String> partitionFieldSortOrder; + + public PartitionFieldSortFunction(PartitionSpec spec, List<String> partitionFieldSortOrder) { + this.spec = spec; + this.partitionFieldSortOrder = partitionFieldSortOrder; + } + + @Override + public String apply(DataFile dataFile) { + Map<java.lang.String, Integer> fieldIndices = Maps.newHashMap(); + int idx = 0; + for (PartitionField field : spec.fields()) { + fieldIndices.put(field.name(), idx); + idx++; + } + + StringBuilder sb = new StringBuilder(); + for (java.lang.String fieldName : partitionFieldSortOrder) { + Preconditions.checkArgument( + fieldIndices.containsKey(fieldName), + "Cannot find partition field with name %s in spec %s", + fieldName, + spec); + int fieldIndex = fieldIndices.get(fieldName); + PartitionField field = spec.fields().get(fieldIndex); + Type fieldType = spec.partitionType().field(field.fieldId()).type(); + Class<?> javaClass = spec.javaClasses()[fieldIndex]; + java.lang.String fieldValue = + field + .transform() + .toHumanString(fieldType, get(dataFile.partition(), fieldIndex, javaClass)); + sb.append(fieldValue); Review Comment: I actually had something very much like this ready to go earlier! The issue with this approach (which was non obvious to me at first) was that `repartitionByRange` on a set of strings doesn't know about relativity/hirearchy. As you point out, this isn't really "sorting", but "clustering". From R3: > Changed the clustering column to be a struct referencing the actual columns instead of a contacted string p1::p2::p3. > Thanks to @ Nandhakumar Saravana for pointing this out, when clustering by a string a::b::c spark doesn't know about relative priority/hierarchy. AKA a::b::c should be close to a::b::d, which it does when clustering by actual columns and not a String literal. What happens when you use a single String like this is that you don't get good clustering on the keys in the order that you specify, unless you happen to have > 1 manifest files' worth of manifests to write. In practice, this means that a table that is initially defined as `PARTITIONED BY (a, b, c)` will have much better metadata organization than one `PARTITIONED BY (c, b, a)` with `RewriteManifests.sort(a, b, c)`. Which is the behavior I'm shoting for here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org