jackye1995 commented on code in PR #9731: URL: https://github.com/apache/iceberg/pull/9731#discussion_r1498195593
########## api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java: ########## @@ -44,6 +45,16 @@ public interface RewriteManifests */ RewriteManifests rewriteIf(Predicate<ManifestFile> predicate); + /** + * Rewrite manifests in a given order, based on partition columns + * + * <p>If not set, manifests will be rewritten in the order of the table's spec. + * + * @param partitionSortOrder a list of partition field names Review Comment: I see the question in the description about > I would next like to have a more capable functional interface sort(row -> {... return groupingString}), but was struggling to express a Java Functionlike input that could be used on a DataSet<Row>'s data_file struct - any pointers on how to parse a Row's struct into a native Pojo and supply a function that can be used like in a UDF here would be apprecaited! I think it aligns with my main comment, which is that we discussed the potential to expose it as a generic function but seems like this PR directly skipped it. Seems like there are some pieces missing, let me see if I can fill it up. At API level, we could express it as a function using `PartitionData` as the transformation source: ``` default RewriteManifests sort(List<String> partitionFieldNames) { return sort(...) // see partitionSpec.partitionToPath for an example of how to convert the partition fields to string } RewriteManifests sort(Function<PartitionData, String> partitionFieldsSortStrategy) { throw new UnsupportedOperationException(); } ``` and when implementing in Spark, the biggest question is now how do we convert the Row `data_file.partition` to `PartititionData` so we can apply this transform and create the new string column in the data frame that is used for sorting. I think this is achievable by creating a UDF that transforms the `data_file.partition` Row to `SparkStructLike` using `SparkStructLike.wrap(row)`, and then create a new `PartitionData` based on that, and finally apply the input function to create its string value. You can see an example of that in `PartitionsTable.toPartitionData`. @RussellSpitzer @nastra let me know if you agree with this general direction, or if it is becoming too convoluted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org