amogh-jahagirdar commented on code in PR #11131:
URL: https://github.com/apache/iceberg/pull/11131#discussion_r1815379066
##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -185,6 +200,13 @@ List<ManifestFile> filterManifests(Schema tableSchema,
List<ManifestFile> manife
return ImmutableList.of();
}
+ // The current set of referenced manifests can be trusted if it is a
subset of the manifests
+ // being filtered. If a single referenced manifest is not in the set of
manifests being filtered
+ // this indicates that the referenced manifests are stale and cannot be
trusted.
+ Set<String> manifestLocations =
Review Comment:
>it means that we can filter the set of referenced manifests rather than the
set of manifests that was passed into this method.
Actually, I think it's more specific than this. It's possible that the set
can be trusted, there's a manifest which is not referenced set, *_and_* the
manifest still needs to be filtered because it either has live entries OR it's
a delete manifest with aged out entries (the min sequence number ends up being
less than the min data sequence number during filtering).
This also ties in to trying to make this logic a bit less complicated since
that last condition I mentioned is what's leading to the current implementation.
##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -185,6 +198,16 @@ List<ManifestFile> filterManifests(Schema tableSchema,
List<ManifestFile> manife
return ImmutableList.of();
}
+ // Use the current set of referenced manifests as a source of truth when
it's a subset of all
+ // manifests and all removals which were performed reference manifests.
+ // If a manifest is not in the trusted referenced set, filtering can be
skipped
+ // assuming that the manifest does not have any live entries or aged out
deletes
+ Set<String> manifestLocations =
+ manifests.stream().map(ManifestFile::path).collect(Collectors.toSet());
+ boolean trustReferencedManifests =
+ allDeletesReferenceManifests
+ && manifestLocations.containsAll(manifestsReferencedForDeletes);
Review Comment:
Let me know if you think the new approach is easier to read or you prefer
going back to the more instance state based approach @rdblue . I think keeping
it all in a single boolean which includes the fact that all the deletes
referenced manifests makes it a little bit better.
##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -185,6 +198,16 @@ List<ManifestFile> filterManifests(Schema tableSchema,
List<ManifestFile> manife
return ImmutableList.of();
}
+ // Use the current set of referenced manifests as a source of truth when
it's a subset of all
+ // manifests and all removals which were performed reference manifests.
+ // If a manifest is not in the trusted referenced set, filtering can be
skipped
+ // assuming that the manifest does not have any live entries or aged out
deletes
+ Set<String> manifestLocations =
+ manifests.stream().map(ManifestFile::path).collect(Collectors.toSet());
+ boolean trustReferencedManifests =
+ allDeletesReferenceManifests
+ && manifestLocations.containsAll(manifestsReferencedForDeletes);
Review Comment:
Instead of making `trustReferencedManifests` an instance variable, I made it
an argument that gets plumbed through. I tried upleveling more state to be
passed through but the challenge is it ends up being more convoluted since
there are cases (e.g. aged out deletes or new manifests with live entries)
which still need filtering even though the trusted set does not reference those
manifests. Upleveling more requires upleveling even more of the filter logic
for these other cases.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]