bknbkn commented on code in PR #9447:
URL: https://github.com/apache/iceberg/pull/9447#discussion_r1458505983
##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteManifestsProcedure.java:
##########
@@ -118,4 +126,15 @@ private InternalRow[] toOutputRows(RewriteManifests.Result
result) {
public String description() {
return "RewriteManifestsProcedure";
}
+
+ private RewriteManifestsSparkAction checkAndApplyFilter(
+ RewriteManifestsSparkAction action, String where, Identifier ident) {
+ if (where != null) {
+ Expression expression = filterExpression(ident, where);
Review Comment:
Thanks for your review @aokolnychyi , In our scenario, `Maintenance Action`
more often exists as a periodic task, we will automatically call `Maintenance
Action` regularly to maintain many existing tables in the warehouse, most of
tables are time-partitioned tables. Adding filter brings the following benefits:
- **Reduce unnecessary manifests IO in the cluster**
Within an action execution interval, only a part of the partitions data of
each table will be updated (in our scenario about 10% of partitions will be
affected, the manifests contained remaining partitions have been aggregated
when the previous action was executed, and no need to re-aggregate). For a
single table, rewrite all manifests will not take more than **10 minutes** in
a distributed situation, but we still hope to have the opportunity to reduce
the overall pressure on the cluster.
- **Conflict avoidance**
When too many small files are written to the upstream of the Iceberg table
at one time, it may trigger manfests merging(By
ManifestMergeManager.mergeGroup). This behavior may conflict with the rewrite
manifests action. Facing streaming tasks, we cannot stop it, can only be solved
by retrying the action. But filter out the partition the task is currently
writing can reduce the occurrence of conflicts.
**By the way,** In V1 table with snapshot-id inherit=true, it still need
write all manifests in driver when execute copyManifests function, I find that
this propertity is no longer affects V2 table, Is any plan to remove or default
true this propertity in the V1 table, or is there still any unsafe?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]