pvary commented on issue #12280:
URL: https://github.com/apache/iceberg/issues/12280#issuecomment-2662219496

   Be careful about rewriting equality deletes to new equality deletes. The 
equality delete will remove every occurrence of the previous row in previous 
commits.
   For example:
   - Commit 1 adds row with PK1, PK2 - Creates a data file with PK1 and PK2
   - Commit 2 deletes PK1 - Creates an equality delete for PK1
   - Commit 3 inserts PK1 - Creates a data file for PK1
   - Commit 4 updates PK2 - Creates an equality delete for PK2, and a data file 
for PK2
   - Commit 5 updates PK2 - Creates an equality delete for PK2, and a data file 
for PK2
   - Commit 6 does the equality delete compaction 
   
   If we compact the equality deletes then we need to decide when these deletes 
should be applied. If we apply them at Commit 6 we lose PK1. If we apply them 
at Commit 2 then we will have duplicated PK2
   
   Converting equality deletes to positional deletes with file granularity 
(spark like), or DVs (Impala like) could help to reduce the number of files to 
read for different readers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to