aokolnychyi commented on PR #10935:
URL: https://github.com/apache/iceberg/pull/10935#issuecomment-2430250182

   Coming back to some questions above.
   
   > Do we want to resolve equality deletes and map them into data files? Or 
should we add a new task and output the content of equality delete files?
   
   While it would be useful to stream out ingested CDC log from Flink without 
applying equality deletes, we shouldn't probably target that for now. If we 
start outputting equality deletes directly, our changelog may not be accurate. 
What if we upsert a record that didn't exist? What if an equality delete 
applied to 10 data records? We can find reasonable behavior but let's skip this 
for now. Let's always resolve and assign equality deletes so that the changelog 
is precise.
   
   > What if a snapshot adds a new delete file for the already deleted record? 
That can happen both with position and equality deletes.
   
   There is one more data point here. We are about to introduce sync 
maintenance for position deletes. This means new delete files will include old 
+ new deleted positions. Therefore, we must always resolve historic deletes.
   
   To sum up, I propose sticking to the existing changelog tasks and always 
resolving historic deletes to produce a correct changelog. I am concerned about 
full table scans for each incremental snapshot. I'll look into ideas mentioned 
above to see if we can optimize that.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to