hantangwangd commented on code in PR #10983: URL: https://github.com/apache/iceberg/pull/10983#discussion_r1733882501
########## core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java: ########## @@ -327,4 +342,34 @@ private Set<String> findFilesToDelete( return filesToDelete; } + + private boolean isSafeToDelete( + ManifestEntry<?> entry, + Map<Long, Long> validSnapshotIdToSequenceNumberMap, + Map<Long, Long> expiredSnapshotIdToSequenceNumberMap) { + if (validSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId()) + || !expiredSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())) { + return false; + } + + // The file in DELETE entry can be deleted if there are no Snapshots older than + // this one + if (validSnapshotIdToSequenceNumberMap.keySet().stream() + .noneMatch(snapshotId -> snapshotId < entry.snapshotId())) { + return true; Review Comment: > Or potentially since we have the invariant here that there's only one ref at this point, we can just validate that there are no ancestors at all. Thanks for providing this information. Based on the invariant you mentioned, we can make some useful decisions rely on the inheritance chain. Firstly, if we find that the parent snapshot is still valid, we can determine directly that the file in DELETED entry cannot be safely deleted. It must exist in the parent snapshot. Secondly, although we cannot determine whether all ancestors no longer exist by judging whether there all still valid ancestors, but conversely, we can confirm a situation that all the ancestors are expired in the same operation through checking the expired snapshots collection. This covers a narrower range, but the situation is quite common (due to the `expire older than` operation). The advantage of rules relying on inheritance chain is that they can support both v1 and v2 tables. And if the rules relying on inheritance chain do not work, we can take advantage of file sequence number to make some decisions for only v2 tables like what we do now, or maybe could figure out some other rules to do more intelligent decisions. I have refactored the code to include the rules mentioned above, please take a look when available. Any informations or ideas will be greatly appreciated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org