bknbkn commented on code in PR #9447:
URL: https://github.com/apache/iceberg/pull/9447#discussion_r1458505983


##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteManifestsProcedure.java:
##########
@@ -118,4 +126,15 @@ private InternalRow[] toOutputRows(RewriteManifests.Result 
result) {
   public String description() {
     return "RewriteManifestsProcedure";
   }
+
+  private RewriteManifestsSparkAction checkAndApplyFilter(
+      RewriteManifestsSparkAction action, String where, Identifier ident) {
+    if (where != null) {
+      Expression expression = filterExpression(ident, where);

Review Comment:
   Thanks for your review @aokolnychyi , In our scenario, `Maintenance Action` 
more often exists as a periodic task, we will automatically call `Maintenance 
Action` regularly to maintain many existing tables in the warehouse, most of 
tables are time-partitioned tables. Adding filter brings the following benefits:
   
   - **Reduce unnecessary manifests IO in the cluster**
   
   Within an action execution interval, only a part of the partition data of 
each table will be updated (in our scenario about 10% of partitions will be 
affected, the manifests contained remaining partitions have been aggregated 
when the previous action was executed, and no need to re-aggregate). For a 
single table, rewrite all manifests  will not take more than **10 minutes**  in 
a distributed situation, but we still hope to have the opportunity to reduce 
the overall pressure on the cluster.
   
   - **Conflict avoidance**
   
   When too many small files are written to the upstream of the Iceberg table 
at one time, it may trigger manfests merging(By 
ManifestMergeManager.mergeGroup). This behavior may conflict with the rewrite 
manifests action. Facing streaming tasks, we cannot stop it, can only be solved 
by retrying the action. But filter out the partition the task is currently 
writing can reduce the occurrence of conflicts.
   
   **By the way,** In V1 table with snapshot-id inherit=true, it still need 
write all manifests in driver when execute copyManifests function, I find that 
this propertity is no longer affects V2 table, Is any plan to remove or default 
true this propertity in the V1 table, or is there still any unsafe?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to