Re: [PR] Spark: Adding simple custom partition sort order option to RewriteManifests Spark Action [iceberg]

via GitHub Wed, 21 Feb 2024 11:38:25 -0800


jackye1995 commented on code in PR #9731:
URL: https://github.com/apache/iceberg/pull/9731#discussion_r1498195593



##########
api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java:
##########
@@ -44,6 +45,16 @@ public interface RewriteManifests
    */
   RewriteManifests rewriteIf(Predicate<ManifestFile> predicate);
 
+  /**
+   * Rewrite manifests in a given order, based on partition columns
+   *
+   * <p>If not set, manifests will be rewritten in the order of the table's 
spec.
+   *
+   * @param partitionSortOrder a list of partition field names

Review Comment:
   I see the question in the description about 
   
   > I would next like to have a more capable functional interface sort(row -> 
{... return groupingString}), but was struggling to express a Java Functionlike 
input that could be used on a DataSet<Row>'s data_file struct - any pointers on 
how to parse a Row's struct into a native Pojo and supply a function that can 
be used like in a UDF here would be apprecaited!
   
   I think it aligns with my main comment, which is that we discussed the 
potential to expose it as a generic function but seems like this PR directly 
skipped it. Seems like there are some pieces missing, let me see if I can fill 
it up.
   
   At API level, we could express it as a function using `PartitionData` as the 
transformation source:
   
   ```
   default RewriteManifests sort(List<String> partitionFieldNames) {
     return sort(...) // see partitionSpec.partitionToPath for an example of 
how to convert the partition fields to string
   }
   
   RewriteManifests sort(Function<PartitionData, String> 
partitionFieldsSortStrategy) {
     throw new UnsupportedOperationException();
   }
   
   ```
   
   and when implementing in Spark, the biggest question is now how do we 
convert the Row `data_file.partition` to `PartititionData` so we can apply this 
transform and create the new string column in the data frame that is used for 
sorting.
   
   I think this is achievable by creating a UDF that transforms the 
`data_file.partition` Row to `SparkStructLike` using 
`SparkStructLike.wrap(row)`, and then create a new `PartitionData` based on 
that, and finally apply the input function to create its string value. You can 
see an example of that in `PartitionsTable.toPartitionData`. 
   
   @RussellSpitzer @nastra let me know if you agree with this general 
direction, or if it is becoming too convoluted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Adding simple custom partition sort order option to RewriteManifests Spark Action [iceberg]

Reply via email to