amogh-jahagirdar commented on code in PR #11131:
URL: https://github.com/apache/iceberg/pull/11131#discussion_r1774350481


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -81,6 +81,7 @@ public String partition() {
 
   // cache filtered manifests to avoid extra work when commits fail.
   private final Map<ManifestFile, ManifestFile> filteredManifests = 
Maps.newConcurrentMap();
+  private final Set<String> manifestsWithDeletedFiles = 
Sets.newConcurrentHashSet();

Review Comment:
   I discussed offline and stepped through `canContainDeletedFiles` with 
@aokolnychyi, one branch in `canContainDeletedFiles` that leads to false 
positives in the current PR is 
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L352.
 
   
   Let's say the majority of manifests don't need to be rewritten and the table 
is unpartitioned we'd return true 
[here](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/util/ManifestFileUtil.java#L109).
 
   
   The same principle applies if the table is partitioned, it still depends on 
how the deletes are grouped and how the manifests are grouped.
   
   To solve against false positives, locally I added a flag which explicitly 
tracks if all the deletes being performed are on files which have referenced 
manifests. If that is the case, we know we can use the set as a source of 
truth; if the referenced location is not in the set then we know with certainty 
the manifest doesn't have to be rewritten.
   
   After changing the implementation, I'm seeing the benchmarks improve a good 
chunk (~30%).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to