amogh-jahagirdar commented on code in PR #10123:
URL: https://github.com/apache/iceberg/pull/10123#discussion_r1566285705


##########
core/src/main/java/org/apache/iceberg/BaseRowDelta.java:
##########
@@ -43,6 +43,10 @@ protected BaseRowDelta self() {
 
   @Override
   protected String operation() {
+    if (!addsDataFiles() && addsDeleteFiles()) {

Review Comment:
   @aokolnychyi I  agree that we should use `replace` when it's the same 
logical set of data but I think the issue is on what `RewriteFiles` actually 
enforces currently. It doesn't do any validation on not only having deletes, 
and I wouldn't expect a logical comparison since that requires a data 
comparison which isn't feasible.
   
   I wrote the following test in `TestRewriteFiles`:
   
   ```
     @TestTemplate
     public void testDelete() {
       commit(table, table.newAppend().appendFile(FILE_A).appendFile(FILE_B), 
branch);
   
       table.newRewrite().deleteFile(FILE_A).toBranch(branch).commit();
   
       table.currentSnapshot();
     }
   
   ```
   and the operation commits successfully with a REPLACE operation. I know it's 
strange to use this API (and goes against what's in the JavaDoc) for just 
deleting a file but as of today it's technically possible and would lead to 
producing a snapshot operation with the incorrect REPLACE operation which can 
lead to issues such as the incremental processing case.
   
   There's 2 ways to address this incorrectness that I can think of:
   
   1.) Fail at the time of committing if there's only deletes or only 
additions. If this is the case then we know that the data cannot possibly be 
logically the same since there'd essentially be a net gain or net loss in data. 
I'm not sure if there's any other cases but this seems like a reasonable check 
which doesn't require actually comparing the data.
   2.) Don't fail (to preserve the API behavior) and rather at the time of 
committing, if it's only a delete or only an append, use the correct API to 
perform the operation with the right summary.
   
   I'm leaning towards 1, since the API docs already makes it clear that from 
an API perspective `RewriteFiles` should only be used when it's logically the 
same. If we know for a fact that it's not with a simple check, we can enforce 
that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to