hantangwangd commented on code in PR #10983:
URL: https://github.com/apache/iceberg/pull/10983#discussion_r1733265734
##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -327,4 +342,34 @@ private Set<String> findFilesToDelete(
return filesToDelete;
}
+
+ private boolean isSafeToDelete(
+ ManifestEntry<?> entry,
+ Map<Long, Long> validSnapshotIdToSequenceNumberMap,
+ Map<Long, Long> expiredSnapshotIdToSequenceNumberMap) {
+ if (validSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())
+ ||
!expiredSnapshotIdToSequenceNumberMap.containsKey(entry.snapshotId())) {
+ return false;
+ }
+
+ // The file in DELETE entry can be deleted if there are no Snapshots older
than
+ // this one
+ if (validSnapshotIdToSequenceNumberMap.keySet().stream()
+ .noneMatch(snapshotId -> snapshotId < entry.snapshotId())) {
+ return true;
Review Comment:
> Snapshot IDs are guaranteed to be unique but not necessarily monotonically
increasing (in fact by default in the reference implementation it's random.
Thanks for your reminder, sorry for misleading by
`TestTableOperations.newSnapshotId()`. Comparing the `snapshotId` is definitely
wrong and I will fix it.
> I think you'll just need to check that there's no parent snapshot for each
of the snapshots in the valid snapshots map.
Checking the parent snapshot may also encounter some problems, for example,
we may have already expired some specified snapshots including the parent of
this snapshot before. In this case, the inheritance chain between snapshots has
been broken. So it seems that we shouldn't rely on it as it may lead in wrong
result.
For example, if we have a table with actions as follows:
```
table.newAppend().appendFile(FILE_A).commit; // snapshotID1
table.newAppend().appendFile(FILE_B).commit; // snapshotID2
table.newDelete().deleteFile(FILE_A).commit; // snapshotID3
table.newAppend().appendFile(FILE_A2).commit; // snapshotID4
```
And we have already expired specified snapshot `snapshotID2` before.
Now when we trying to expire snapshot `snapshotID3` using the policy of
`IncrementalFileCleanup`, `FILE_A` would be wrongly cleaned up if we rely on
checking the parent snapshot.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]