amogh-jahagirdar commented on code in PR #10123:
URL: https://github.com/apache/iceberg/pull/10123#discussion_r1566285705
##########
core/src/main/java/org/apache/iceberg/BaseRowDelta.java:
##########
@@ -43,6 +43,10 @@ protected BaseRowDelta self() {
@Override
protected String operation() {
+ if (!addsDataFiles() && addsDeleteFiles()) {
Review Comment:
@aokolnychyi I agree that we should use `replace` when it's the same
logical set of data but I think the issue is on what `RewriteFiles` actually
enforces currently. It doesn't do any validation on not only having deletes,
and I wouldn't expect a logical comparison since that requires a data
comparison which isn't feasible.
I wrote the following test in `TestRewriteFiles`:
```
@TestTemplate
public void testDelete() {
commit(table, table.newAppend().appendFile(FILE_A).appendFile(FILE_B),
branch);
table.newRewrite().deleteFile(FILE_A).toBranch(branch).commit();
table.currentSnapshot();
}
```
and the operation commits successfully with a REPLACE operation. I know it's
strange to use this API (and goes against what's in the JavaDoc) for just
deleting a file but as of today it's technically possible and would lead to
producing a snapshot operation with the incorrect REPLACE operation which can
lead to issues such as the incremental processing case.
There's 2 ways to address this incorrectness that I can think of:
1.) Fail at the time of committing if there's only deletes or only
additions. If this is the case then we know that the data cannot possibly be
logically the same since there'd essentially be a net gain or net loss in data.
I'm not sure if there's any other cases but this seems like a reasonable check
which doesn't require actually comparing the data.
2.) Don't fail (to preserve the API behavior) and rather at the time of
committing, if it's only a delete or only an append, use the correct API to
perform the operation with the right summary.
I'm leaning towards 1, since the API docs already makes it clear that from
an API perspective `RewriteFiles` should only be used when it's logically the
same. If we know for a fact that it's not locally the same table with a simple
check, we can enforce that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]