amogh-jahagirdar commented on code in PR #10983: URL: https://github.com/apache/iceberg/pull/10983#discussion_r1731788916
########## core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java: ########## @@ -61,17 +63,21 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira // 4. Delete the manifest lists Set<Long> validIds = Sets.newHashSet(); + Map<Long, Long> validSnapshotIdToSequenceNumberMap = Maps.newConcurrentMap(); for (Snapshot snapshot : afterExpiration.snapshots()) { validIds.add(snapshot.snapshotId()); + validSnapshotIdToSequenceNumberMap.put(snapshot.snapshotId(), snapshot.sequenceNumber()); } Set<Long> expiredIds = Sets.newHashSet(); + Map<Long, Long> expiredSnapshotIdToSequenceNumberMap = Maps.newConcurrentMap(); Review Comment: Same as above, `expiredSnapshotToSequenceNumber` ########## core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java: ########## @@ -327,4 +342,34 @@ private Set<String> findFilesToDelete( return filesToDelete; } + + private boolean isSafeToDelete( + ManifestEntry<?> entry, + Map<Long, Long> validSnapshotIdToSequenceNumberMap, + Map<Long, Long> expiredSnapshotIdToSequenceNumberMap) { + if (validSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId()) + || !expiredSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())) { + return false; + } + + // The file in DELETE entry can be deleted if there are no Snapshots older than + // this one + if (validSnapshotIdToSequenceNumberMap.keySet().stream() + .noneMatch(snapshotId -> snapshotId < entry.snapshotId())) { + return true; Review Comment: I think you'll just need to check that there's no parent snapshot for each of the snapshots in the valid snapshots map ########## core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java: ########## @@ -327,4 +342,34 @@ private Set<String> findFilesToDelete( return filesToDelete; } + + private boolean isSafeToDelete( + ManifestEntry<?> entry, + Map<Long, Long> validSnapshotIdToSequenceNumberMap, + Map<Long, Long> expiredSnapshotIdToSequenceNumberMap) { + if (validSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId()) + || !expiredSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())) { + return false; + } + + // The file in DELETE entry can be deleted if there are no Snapshots older than + // this one + if (validSnapshotIdToSequenceNumberMap.keySet().stream() + .noneMatch(snapshotId -> snapshotId < entry.snapshotId())) { + return true; + } + + // The file in DELETE entry can be deleted if no snapshot exists between + // the one that added it and this one + long fileSequenceNumber = entry.fileSequenceNumber() == null ? 0L : entry.fileSequenceNumber(); + long fileDeletedSnapshotSequenceNumber = + expiredSnapshotIdToSequenceNumberMap.get(entry.snapshotId()); + return fileSequenceNumber > 0 Review Comment: Actually we may not need to rely on the sequence numbers: ``` // The file in DELETE entry can be deleted if no snapshot exists between // the one that added it and this one ``` I agree with this statement but doesn't this just mean we need to check the parent snapshots manifest entries for this particular file to see if it was added specifically in the parent snapshot? Although this is another manifest read I feel like it's OK (it's not as intensive as the reachable cleanup case/still feels incremental enough) and I think it should also solve the V1 case. ########## core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java: ########## @@ -61,17 +63,21 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira // 4. Delete the manifest lists Set<Long> validIds = Sets.newHashSet(); + Map<Long, Long> validSnapshotIdToSequenceNumberMap = Maps.newConcurrentMap(); Review Comment: `validSnapshotToSequenceNumber` or `retainedSnapshotToSequenceNumber`? The map suffix is redundant. Also with this new approach, we may also need to look at refactoring to get rid of the other `validIds` set since now we can just use this new map ########## core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java: ########## @@ -327,4 +342,34 @@ private Set<String> findFilesToDelete( return filesToDelete; } + + private boolean isSafeToDelete( + ManifestEntry<?> entry, + Map<Long, Long> validSnapshotIdToSequenceNumberMap, + Map<Long, Long> expiredSnapshotIdToSequenceNumberMap) { + if (validSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId()) + || !expiredSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())) { + return false; + } + + // The file in DELETE entry can be deleted if there are no Snapshots older than + // this one + if (validSnapshotIdToSequenceNumberMap.keySet().stream() + .noneMatch(snapshotId -> snapshotId < entry.snapshotId())) { + return true; + } + + // The file in DELETE entry can be deleted if no snapshot exists between + // the one that added it and this one + long fileSequenceNumber = entry.fileSequenceNumber() == null ? 0L : entry.fileSequenceNumber(); + long fileDeletedSnapshotSequenceNumber = + expiredSnapshotIdToSequenceNumberMap.get(entry.snapshotId()); + return fileSequenceNumber > 0 Review Comment: Since we're using sequence numbers this means that we're also saying for v1 tables we can't determine to clean it up in the incremental strategy? I think that's OK since orphan files can clean it up later. Worth thinking through other ways to determine this beyond leveraging sequence numbers. ########## core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java: ########## @@ -327,4 +342,34 @@ private Set<String> findFilesToDelete( return filesToDelete; } + + private boolean isSafeToDelete( + ManifestEntry<?> entry, + Map<Long, Long> validSnapshotIdToSequenceNumberMap, + Map<Long, Long> expiredSnapshotIdToSequenceNumberMap) { + if (validSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId()) + || !expiredSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())) { + return false; + } + + // The file in DELETE entry can be deleted if there are no Snapshots older than + // this one + if (validSnapshotIdToSequenceNumberMap.keySet().stream() + .noneMatch(snapshotId -> snapshotId < entry.snapshotId())) { + return true; Review Comment: This is not right since we cannot use the snapshot ID as an ordering. Snapshot IDs are guaranteed to be unique but not necessarily monotonically increasing (in fact by default in the reference implementation it's random https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SnapshotIdGeneratorUtil.java#L36) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org