amogh-jahagirdar commented on code in PR #15310:
URL: https://github.com/apache/iceberg/pull/15310#discussion_r2819660938


##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java:
##########
@@ -329,7 +330,9 @@ private Column sortColumn() {
   }
 
   private Dataset<Row> repartitionAndSort(Dataset<Row> df, Column col, int 
numPartitions) {
-    return df.repartitionByRange(numPartitions, col).sortWithinPartitions(col);
+    // add file path for range partition to make sure we have enough 
parallelism

Review Comment:
   I can see how the current implementation doesn't leverage full parallelism 
when there is very low cardinality on partition values. My understanding of 
[Range 
partitioning](https://github.com/apache/spark/blame/branch-4.1/core/src/main/scala/org/apache/spark/Partitioner.scala#L235)
 is that it primarily looks at  trying to identify partitions with lots of 
values and tries to balance partition sizes, but in case of low cardinality 
values to begin with it doesn't add to the parallelism (that 
math.min(partitions, candidates.size) looks like it'd be bounded by the the 
cardinality of iceberg partition values since the number of candidates would be 
small.
   
   I don't think AQE would also help here since my understanding is that it'll 
help coalesce too many small partitions, help with splitting partitions during 
a join etc. Don't think it would help in this case.
   
   I think the proposed solution for by adding a higher cardinality column like 
file path to the repartition by range is a good way of adding parallelism here, 
and I can't imagine it regressing any existing cases (since the 
repartitionByRange already does balancing within partitions) though someone 
knowledgable here should chime in in case there's more implications. cc 
@aokolnychyi @RussellSpitzer @rdblue  @singhpk234 @huaxingao 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to