amogh-jahagirdar commented on code in PR #10123: URL: https://github.com/apache/iceberg/pull/10123#discussion_r1566285705
########## core/src/main/java/org/apache/iceberg/BaseRowDelta.java: ########## @@ -43,6 +43,10 @@ protected BaseRowDelta self() { @Override protected String operation() { + if (!addsDataFiles() && addsDeleteFiles()) { Review Comment: @aokolnychyi I agree that we should use `replace` when it's the same logical set of data but I think the issue is on what `RewriteFiles` actually enforces currently. It doesn't do any validation on not only having deletes, and I wouldn't expect a logical comparison since that requires a data comparison which isn't feasible. I wrote the following test in `TestRewriteFiles`: ``` @TestTemplate public void testDelete() { commit(table, table.newAppend().appendFile(FILE_A).appendFile(FILE_B), branch); table.newRewrite().deleteFile(FILE_A).toBranch(branch).commit(); table.currentSnapshot(); } ``` and the operation commits successfully with a REPLACE operation. I know it's strange to use this API (and goes against what's in the JavaDoc) for just deleting a file but as of today it's technically possible and would lead to producing a snapshot operation with the incorrect REPLACE operation which can lead to issues such as the incremental processing case. There's 2 ways to address this incorrectness that I can think of: 1.) Fail at the time of committing if there's only deletes or only additions. If this is the case then we know that the data cannot possibly be logically the same since there'd essentially be a net gain or net loss in data. I'm not sure if there's any other cases but this seems like a reasonable check which doesn't require actually comparing the data. 2.) Don't fail (to preserve the API behavior) and rather at the time of committing, if it's only a delete or only an append, use the correct API to perform the operation with the right summary. I'm leaning towards 1, since the API docs already makes it clear that from an API perspective `RewriteFiles` should only be used when it's logically the same. If we know for a fact that it's not with a simple check, we can enforce that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org