jackye1995 commented on code in PR #9731:
URL: https://github.com/apache/iceberg/pull/9731#discussion_r1498195593
##########
api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java:
##########
@@ -44,6 +45,16 @@ public interface RewriteManifests
*/
RewriteManifests rewriteIf(Predicate<ManifestFile> predicate);
+ /**
+ * Rewrite manifests in a given order, based on partition columns
+ *
+ * <p>If not set, manifests will be rewritten in the order of the table's
spec.
+ *
+ * @param partitionSortOrder a list of partition field names
Review Comment:
I see the question in the description about
> I would next like to have a more capable functional interface sort(row ->
{... return groupingString}), but was struggling to express a Java Functionlike
input that could be used on a DataSet<Row>'s data_file struct - any pointers on
how to parse a Row's struct into a native Pojo and supply a function that can
be used like in a UDF here would be apprecaited!
I think it aligns with my main comment, which is that we discussed the
potential to expose it as a generic function but seems like this PR directly
skipped it. Seems like there are some pieces missing, let me see if I can fill
it up.
At API level, we could express it as a function using `PartitionData` as the
transformation source:
```
default RewriteManifests sort(List<String> partitionFieldNames) {
return sort(...) // see partitionSpec.partitionToPath for an example of
how to convert the partition fields to string
}
RewriteManifests sort(Function<PartitionData, String>
partitionFieldsSortStrategy) {
throw new UnsupportedOperationException();
}
```
and when implementing in Spark, the biggest question is now how do we
convert the Row `data_file.partition` to `PartititionData` so we can apply this
transform and create the new string column in the data frame that is used for
sorting.
I think this is achievable by creating a UDF that transforms the
`data_file.partition` Row to `SparkStructLike` using
`SparkStructLike.wrap(row)`, and then create a new `PartitionData` based on
that, and finally apply the input function to create its string value. You can
see an example of that in `PartitionsTable.toPartitionData`.
@RussellSpitzer @nastra let me know if you agree with this general
direction, or if it is becoming too convoluted.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]