pvary commented on issue #12280: URL: https://github.com/apache/iceberg/issues/12280#issuecomment-2662219496
Be careful about rewriting equality deletes to new equality deletes. The equality delete will remove every occurrence of the previous row in previous commits. For example: - Commit 1 adds row with PK1, PK2 - Creates a data file with PK1 and PK2 - Commit 2 deletes PK1 - Creates an equality delete for PK1 - Commit 3 inserts PK1 - Creates a data file for PK1 - Commit 4 updates PK2 - Creates an equality delete for PK2, and a data file for PK2 - Commit 5 updates PK2 - Creates an equality delete for PK2, and a data file for PK2 - Commit 6 does the equality delete compaction If we compact the equality deletes then we need to decide when these deletes should be applied. If we apply them at Commit 6 we lose PK1. If we apply them at Commit 2 then we will have duplicated PK2 Converting equality deletes to positional deletes with file granularity (spark like), or DVs (Impala like) could help to reduce the number of files to read for different readers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org