pvary commented on PR #14092: URL: https://github.com/apache/iceberg/pull/14092#issuecomment-3328574516
> IMHO we can only aggregate WriteResults without delete files (append-only), unless we change Iceberg core to enforce an order on how data / delete files are applied. I'd be curious to hear @pvary's thoughts on this. We can only commit files for multiple checkpoints when there are only appends/data files in the checkpoint. For Iceberg everything which is committed in a single transaction happened at the same time. So if we have equality deletes for both checkpoints, then they will be applied together. Consider this scenario: - R1 insert - new data file (DF1) with R1 (PK1) - C1 commit - R1' update - new equality delete file with PK1 (EQ1), new data file (DF2) with R1' - C2 commit - R1'' update - new equality delete file with PK1 (EQ2), new data file (DF3) with R1'' - C3 commit If we merge C2 and C3 commit, then we add EQ1 and EQ2 in the same commit, and they will be applied only for DF1. They will not be applied to DF2 or DF3 as they are added in the same commit, and as a result we will have a duplication in our table. Both R1' and R1'' will be present after C3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
