amogh-jahagirdar commented on code in PR #11131:
URL: https://github.com/apache/iceberg/pull/11131#discussion_r1776062427
##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -325,7 +341,15 @@ private ManifestFile filterManifest(Schema tableSchema,
ManifestFile manifest) {
}
}
+ @SuppressWarnings("checkstyle:CyclomaticComplexity")
private boolean canContainDeletedFiles(ManifestFile manifest) {
+ boolean manifestReferencedInDelete =
manifestsWithDeletedFiles.contains(manifest.path());
+ if (manifest.content() == ManifestContent.DELETES &&
allDeletesReferenceManifests) {
Review Comment:
The reason for this condition on the manifest being a delete manifest is
that there is still a remaining case that I'm seeing where we cannot use the
`manifestsWithDeletedFiles` set as an absolute source of truth. It showed up in
[this
test](https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestSnapshotManager.java#L58).
When performing the cherry pick operation with a dynamic overwrite, all the
deleted operations (the deletion of FILE_A) do reference manifests.
However, the manifest that actually gets passed for filtering is the
manifest produced for the appending of file A. Even though it's not in the
manifestsWtihDeletedFiles set, we actually do need to rewrite in this case
because the cherry pick is replacing the file with FILE_A_REPLACEMENT.
In essence, we relied on the deleted file partitions for this check
previously and it seems like we still need to. If we limit for delete file
cases and all the files to delete reference manifests I don't think there would
be a case where if manifest was not in the set we'd still need to rewrite it
since it's not possible to just delete during a cherry pick.
This implementation does avoid the false positive reads for delete manifests
which will be important for sync maintenance as well as also efficiently
identifies for both data/delete manifests which are referenced, and that seems
to bring a benefit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]