amogh-jahagirdar commented on code in PR #11131: URL: https://github.com/apache/iceberg/pull/11131#discussion_r1815379066
########## core/src/main/java/org/apache/iceberg/ManifestFilterManager.java: ########## @@ -185,6 +200,13 @@ List<ManifestFile> filterManifests(Schema tableSchema, List<ManifestFile> manife return ImmutableList.of(); } + // The current set of referenced manifests can be trusted if it is a subset of the manifests + // being filtered. If a single referenced manifest is not in the set of manifests being filtered + // this indicates that the referenced manifests are stale and cannot be trusted. + Set<String> manifestLocations = Review Comment: >it means that we can filter the set of referenced manifests rather than the set of manifests that was passed into this method. Actually, I think it's more specific than this. It's possible that the set can be trusted, there's a manifest which is not referenced set, *_and_* the manifest still needs to be filtered because it either has live entries OR it's a delete manifest with aged out entries (the min sequence number ends up being less than the min data sequence number during filtering). This also ties in to trying to make this logic a bit less complicated since that last condition I mentioned is what's leading to the current implementation. ########## core/src/main/java/org/apache/iceberg/ManifestFilterManager.java: ########## @@ -185,6 +198,16 @@ List<ManifestFile> filterManifests(Schema tableSchema, List<ManifestFile> manife return ImmutableList.of(); } + // Use the current set of referenced manifests as a source of truth when it's a subset of all + // manifests and all removals which were performed reference manifests. + // If a manifest is not in the trusted referenced set, filtering can be skipped + // assuming that the manifest does not have any live entries or aged out deletes + Set<String> manifestLocations = + manifests.stream().map(ManifestFile::path).collect(Collectors.toSet()); + boolean trustReferencedManifests = + allDeletesReferenceManifests + && manifestLocations.containsAll(manifestsReferencedForDeletes); Review Comment: Let me know if you think the new approach is easier to read or you prefer going back to the more instance state based approach @rdblue . I think keeping it all in a single boolean which includes the fact that all the deletes referenced manifests makes it a little bit better. ########## core/src/main/java/org/apache/iceberg/ManifestFilterManager.java: ########## @@ -185,6 +198,16 @@ List<ManifestFile> filterManifests(Schema tableSchema, List<ManifestFile> manife return ImmutableList.of(); } + // Use the current set of referenced manifests as a source of truth when it's a subset of all + // manifests and all removals which were performed reference manifests. + // If a manifest is not in the trusted referenced set, filtering can be skipped + // assuming that the manifest does not have any live entries or aged out deletes + Set<String> manifestLocations = + manifests.stream().map(ManifestFile::path).collect(Collectors.toSet()); + boolean trustReferencedManifests = + allDeletesReferenceManifests + && manifestLocations.containsAll(manifestsReferencedForDeletes); Review Comment: Instead of making `trustReferencedManifests` an instance variable, I made it an argument that gets plumbed through. I tried upleveling more state to be passed through but the challenge is it ends up being more convoluted since there are cases (e.g. aged out deletes or new manifests with live entries) which still need filtering even though the trusted set does not reference those manifests. Upleveling more requires upleveling even more of the filter logic for these other cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org