amogh-jahagirdar commented on code in PR #11131: URL: https://github.com/apache/iceberg/pull/11131#discussion_r1774350481
########## core/src/main/java/org/apache/iceberg/ManifestFilterManager.java: ########## @@ -81,6 +81,7 @@ public String partition() { // cache filtered manifests to avoid extra work when commits fail. private final Map<ManifestFile, ManifestFile> filteredManifests = Maps.newConcurrentMap(); + private final Set<String> manifestsWithDeletedFiles = Sets.newConcurrentHashSet(); Review Comment: I discussed offline and stepped through `canContainDeletedFiles` with @aokolnychyi, one branch in `canContainDeletedFiles` that leads to false positives in the current PR is https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L352. Let's say the majority of manifests don't need to be rewritten and the table is unpartitioned we'd return true [here](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/util/ManifestFileUtil.java#L109). The same principle applies if the table is partitioned, it still depends on how the deletes are grouped and how the manifests are grouped. To solve against false positives, locally I added a flag which explicitly tracks if all the deletes being performed are on files which have referenced manifests. If that is the case, we know we can use the set as a source of truth; if the referenced location is not in the set then we know with certainty the manifest doesn't have to be rewritten. After changing the implementation, I'm seeing the benchmarks improve a good chunk (~30/40%). ########## core/src/main/java/org/apache/iceberg/ManifestFilterManager.java: ########## @@ -81,6 +81,7 @@ public String partition() { // cache filtered manifests to avoid extra work when commits fail. private final Map<ManifestFile, ManifestFile> filteredManifests = Maps.newConcurrentMap(); + private final Set<String> manifestsWithDeletedFiles = Sets.newConcurrentHashSet(); Review Comment: I discussed offline and stepped through `canContainDeletedFiles` with @aokolnychyi, one branch in `canContainDeletedFiles` that leads to false positives in the current PR is https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L352. Let's say the majority of manifests don't need to be rewritten and the table is unpartitioned we'd return true [here](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/util/ManifestFileUtil.java#L109). The same principle applies if the table is partitioned, it still depends on how the deletes are grouped and how the manifests are grouped. To solve against false positives, locally I added a flag which explicitly tracks if all the deletes being performed are on files which have referenced manifests. If that is the case, we know we can use the set as a source of truth; if the referenced location is not in the set then we know with certainty the manifest doesn't have to be rewritten. After changing the implementation, I'm seeing the benchmarks improve a good chunk (~30-40%). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org