amogh-jahagirdar commented on code in PR #11131:
URL: https://github.com/apache/iceberg/pull/11131#discussion_r1815379066


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -185,6 +200,13 @@ List<ManifestFile> filterManifests(Schema tableSchema, 
List<ManifestFile> manife
       return ImmutableList.of();
     }
 
+    // The current set of referenced manifests can be trusted if it is a 
subset of the manifests
+    // being filtered. If a single referenced manifest is not in the set of 
manifests being filtered
+    // this indicates that the referenced manifests are stale and cannot be 
trusted.
+    Set<String> manifestLocations =

Review Comment:
   >it means that we can filter the set of referenced manifests rather than the 
set of manifests that was passed into this method.
   
   Actually, I think it's more specific than this. It's possible that the set 
can be trusted, there's a manifest which is not referenced set, *_and_* the 
manifest still needs to be filtered because it either has live entries OR it's 
a delete manifest with aged out entries (the min sequence number ends up being 
less than the min data sequence number during filtering). 
   
   
   This also ties in to trying to make this logic a bit less complicated since 
that last condition I mentioned is what's leading to the current implementation.



##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -185,6 +198,16 @@ List<ManifestFile> filterManifests(Schema tableSchema, 
List<ManifestFile> manife
       return ImmutableList.of();
     }
 
+    // Use the current set of referenced manifests as a source of truth when 
it's a subset of all
+    // manifests and all removals which were performed reference manifests.
+    // If a manifest is not in the trusted referenced set, filtering can be 
skipped
+    // assuming that the manifest does not have any live entries or aged out 
deletes
+    Set<String> manifestLocations =
+        manifests.stream().map(ManifestFile::path).collect(Collectors.toSet());
+    boolean trustReferencedManifests =
+        allDeletesReferenceManifests
+            && manifestLocations.containsAll(manifestsReferencedForDeletes);

Review Comment:
   Let me know if you think the new approach is easier to read or you prefer 
going back to the more instance state based approach @rdblue . I think keeping 
it all in a single boolean which includes the fact that all the deletes 
referenced manifests makes it a little bit better.



##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -185,6 +198,16 @@ List<ManifestFile> filterManifests(Schema tableSchema, 
List<ManifestFile> manife
       return ImmutableList.of();
     }
 
+    // Use the current set of referenced manifests as a source of truth when 
it's a subset of all
+    // manifests and all removals which were performed reference manifests.
+    // If a manifest is not in the trusted referenced set, filtering can be 
skipped
+    // assuming that the manifest does not have any live entries or aged out 
deletes
+    Set<String> manifestLocations =
+        manifests.stream().map(ManifestFile::path).collect(Collectors.toSet());
+    boolean trustReferencedManifests =
+        allDeletesReferenceManifests
+            && manifestLocations.containsAll(manifestsReferencedForDeletes);

Review Comment:
   Instead of making `trustReferencedManifests` an instance variable, I made it 
an argument that gets plumbed through. I tried upleveling more state to be 
passed through but the challenge is it ends up being more convoluted since 
there are cases (e.g. aged out deletes or new manifests with live entries) 
which still need filtering even though the trusted set does not reference those 
manifests. Upleveling more requires upleveling even more of the filter logic 
for these other cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to