amogh-jahagirdar commented on code in PR #10983:
URL: https://github.com/apache/iceberg/pull/10983#discussion_r1731788916


##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -61,17 +63,21 @@ public void cleanFiles(TableMetadata beforeExpiration, 
TableMetadata afterExpira
     // 4. Delete the manifest lists
 
     Set<Long> validIds = Sets.newHashSet();
+    Map<Long, Long> validSnapshotIdToSequenceNumberMap = 
Maps.newConcurrentMap();
     for (Snapshot snapshot : afterExpiration.snapshots()) {
       validIds.add(snapshot.snapshotId());
+      validSnapshotIdToSequenceNumberMap.put(snapshot.snapshotId(), 
snapshot.sequenceNumber());
     }
 
     Set<Long> expiredIds = Sets.newHashSet();
+    Map<Long, Long> expiredSnapshotIdToSequenceNumberMap = 
Maps.newConcurrentMap();

Review Comment:
   Same as above, `expiredSnapshotToSequenceNumber`



##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -327,4 +342,34 @@ private Set<String> findFilesToDelete(
 
     return filesToDelete;
   }
+
+  private boolean isSafeToDelete(
+      ManifestEntry<?> entry,
+      Map<Long, Long> validSnapshotIdToSequenceNumberMap,
+      Map<Long, Long> expiredSnapshotIdToSequenceNumberMap) {
+    if (validSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())
+        || 
!expiredSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())) {
+      return false;
+    }
+
+    // The file in DELETE entry can be deleted if there are no Snapshots older 
than
+    // this one
+    if (validSnapshotIdToSequenceNumberMap.keySet().stream()
+        .noneMatch(snapshotId -> snapshotId < entry.snapshotId())) {
+      return true;

Review Comment:
   I think you'll just need to check that there's no parent snapshot for each 
of the snapshots in the valid snapshots map



##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -327,4 +342,34 @@ private Set<String> findFilesToDelete(
 
     return filesToDelete;
   }
+
+  private boolean isSafeToDelete(
+      ManifestEntry<?> entry,
+      Map<Long, Long> validSnapshotIdToSequenceNumberMap,
+      Map<Long, Long> expiredSnapshotIdToSequenceNumberMap) {
+    if (validSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())
+        || 
!expiredSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())) {
+      return false;
+    }
+
+    // The file in DELETE entry can be deleted if there are no Snapshots older 
than
+    // this one
+    if (validSnapshotIdToSequenceNumberMap.keySet().stream()
+        .noneMatch(snapshotId -> snapshotId < entry.snapshotId())) {
+      return true;
+    }
+
+    // The file in DELETE entry can be deleted if no snapshot exists between
+    // the one that added it and this one
+    long fileSequenceNumber = entry.fileSequenceNumber() == null ? 0L : 
entry.fileSequenceNumber();
+    long fileDeletedSnapshotSequenceNumber =
+        expiredSnapshotIdToSequenceNumberMap.get(entry.snapshotId());
+    return fileSequenceNumber > 0

Review Comment:
   Actually we may not need to rely on the sequence numbers:
   
   ```
   // The file in DELETE entry can be deleted if no snapshot exists between
   // the one that added it and this one
   ```
   
   I agree with this statement but doesn't this just mean we need to check the 
parent snapshots manifest entries for this particular file to see if it was 
added specifically in the parent snapshot? Although this is another manifest 
read I feel like it's OK (it's not as intensive as the reachable cleanup 
case/still feels incremental enough) and I think it should also solve the V1 
case.



##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -61,17 +63,21 @@ public void cleanFiles(TableMetadata beforeExpiration, 
TableMetadata afterExpira
     // 4. Delete the manifest lists
 
     Set<Long> validIds = Sets.newHashSet();
+    Map<Long, Long> validSnapshotIdToSequenceNumberMap = 
Maps.newConcurrentMap();

Review Comment:
   `validSnapshotToSequenceNumber` or `retainedSnapshotToSequenceNumber`? The 
map suffix is redundant. 
   
   Also with this new approach, we may also need to look at refactoring to get 
rid of the other `validIds` set since now we can just use this new map



##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -327,4 +342,34 @@ private Set<String> findFilesToDelete(
 
     return filesToDelete;
   }
+
+  private boolean isSafeToDelete(
+      ManifestEntry<?> entry,
+      Map<Long, Long> validSnapshotIdToSequenceNumberMap,
+      Map<Long, Long> expiredSnapshotIdToSequenceNumberMap) {
+    if (validSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())
+        || 
!expiredSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())) {
+      return false;
+    }
+
+    // The file in DELETE entry can be deleted if there are no Snapshots older 
than
+    // this one
+    if (validSnapshotIdToSequenceNumberMap.keySet().stream()
+        .noneMatch(snapshotId -> snapshotId < entry.snapshotId())) {
+      return true;
+    }
+
+    // The file in DELETE entry can be deleted if no snapshot exists between
+    // the one that added it and this one
+    long fileSequenceNumber = entry.fileSequenceNumber() == null ? 0L : 
entry.fileSequenceNumber();
+    long fileDeletedSnapshotSequenceNumber =
+        expiredSnapshotIdToSequenceNumberMap.get(entry.snapshotId());
+    return fileSequenceNumber > 0

Review Comment:
   Since we're using sequence numbers this means that we're also saying for v1 
tables we can't determine to clean it up in the incremental strategy? I think 
that's OK since orphan files can clean it up later. Worth thinking through 
other ways to determine this beyond leveraging sequence numbers.



##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -327,4 +342,34 @@ private Set<String> findFilesToDelete(
 
     return filesToDelete;
   }
+
+  private boolean isSafeToDelete(
+      ManifestEntry<?> entry,
+      Map<Long, Long> validSnapshotIdToSequenceNumberMap,
+      Map<Long, Long> expiredSnapshotIdToSequenceNumberMap) {
+    if (validSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())
+        || 
!expiredSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())) {
+      return false;
+    }
+
+    // The file in DELETE entry can be deleted if there are no Snapshots older 
than
+    // this one
+    if (validSnapshotIdToSequenceNumberMap.keySet().stream()
+        .noneMatch(snapshotId -> snapshotId < entry.snapshotId())) {
+      return true;

Review Comment:
   This is not right since we cannot use the snapshot ID as an ordering. 
Snapshot IDs are guaranteed to be unique but not necessarily monotonically 
increasing (in fact by default in the reference implementation it's random 
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SnapshotIdGeneratorUtil.java#L36)
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to