amogh-jahagirdar commented on code in PR #15310:
URL: https://github.com/apache/iceberg/pull/15310#discussion_r2819660938
##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java:
##########
@@ -329,7 +330,9 @@ private Column sortColumn() {
}
private Dataset<Row> repartitionAndSort(Dataset<Row> df, Column col, int
numPartitions) {
- return df.repartitionByRange(numPartitions, col).sortWithinPartitions(col);
+ // add file path for range partition to make sure we have enough
parallelism
Review Comment:
I can see how the current implementation doesn't leverage full parallelism
when there is very low cardinality on partition values. My understanding of
[Range
partitioning](https://github.com/apache/spark/blame/branch-4.1/core/src/main/scala/org/apache/spark/Partitioner.scala#L235)
is that it primarily looks at trying to identify partitions with lots of
values and tries to balance partition sizes, but in case of low cardinality
values to begin with it doesn't add to the parallelism (that
math.min(partitions, candidates.size) looks like it'd be bounded by the the
cardinality of iceberg partition values since the number of candidates would be
small.
I don't think AQE would also help here since my understanding is that it'll
help coalesce too many small partitions, help with splitting partitions during
a join etc. Don't think it would help in this case.
I think the proposed solution for by adding a higher cardinality column like
file path to the repartition by range is a good way of adding parallelism here,
and I can't imagine it regressing any existing cases (since the
repartitionByRange already does balancing within partitions) though someone
knowledgable here should chime in in case there's more implications. cc
@aokolnychyi @RussellSpitzer @rdblue @singhpk234 @huaxingao
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]