rdblue commented on code in PR #13614:
URL: https://github.com/apache/iceberg/pull/13614#discussion_r2220580612


##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -50,7 +50,7 @@ class IncrementalFileCleanup extends FileCleanupStrategy {
   @Override
   @SuppressWarnings({"checkstyle:CyclomaticComplexity", "MethodLength"})
   public void cleanFiles(TableMetadata beforeExpiration, TableMetadata 
afterExpiration) {
-    if (afterExpiration.refs().size() > 1) {
+    if (beforeExpiration.refs().size() > 1 || afterExpiration.refs().size() > 
1) {

Review Comment:
   Could this leak files if there is a rollback or a staged (but not 
referenced) commit?
   
   For instance, if you have snapshots like this:
   ```
   A --- B --- C --- D (main)
          `--- X (unreferenced)
   ```
   
   If snapshot X should be cleaned up, Iceberg should not use the incremental 
strategy. Also, if X is not removed but B and C are, then incremental also 
can't be used because C may have removed files from X.
   
   Instead of checking the number of refs, I think we need to check the history 
of the main branch. If there is any snapshot outside of the main branch before 
or after expiration, incremental cannot be used. There are cases where a ref is 
in the main branch's history that are safe, but I would probably keep the 
implementation simpler and avoid checking for things like branch overlap.
   
   What I would do is:
   1. If there is more than one ref after expiration, use the reference set 
strategy (optionally, we could check that all refs are in the main branch's 
history and allow it)
   2. If there are any snapshots to be removed after expiration that were not 
the main branch's ancestors, use the reference set strategy



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to