huaxingao commented on code in PR #15310:
URL: https://github.com/apache/iceberg/pull/15310#discussion_r2824680373


##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java:
##########
@@ -329,7 +330,9 @@ private Column sortColumn() {
   }
 
   private Dataset<Row> repartitionAndSort(Dataset<Row> df, Column col, int 
numPartitions) {
-    return df.repartitionByRange(numPartitions, col).sortWithinPartitions(col);
+    // add file path for range partition to make sure we have enough 
parallelism

Review Comment:
   This makes sense to me. `repartitionByRange(data_file.partition)` can lose 
parallelism when there are only a few partition values (for example, everything 
is in one partition). I don't think AQE fixes this for a range repartition. 
Spark mainly splits “big/skewed” shuffle partitions for joins, or for 
`rebalance(...)` / `RebalancePartitions`. I don’t think it does this for 
`repartitionByRange`. Adding `data_file.file_path` as a second range column 
looks like a good, deterministic way to get more parallel work, while still 
keeping partition as the first (main) clustering key.
   
    Small perf note: `file_path` can be a long string, so using it as a range 
key may add extra compare/sort cost. Would using a cheaper salt like 
xxhash64(file_path) work instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to