aokolnychyi commented on PR #10935: URL: https://github.com/apache/iceberg/pull/10935#issuecomment-2430250182
Coming back to some questions above. > Do we want to resolve equality deletes and map them into data files? Or should we add a new task and output the content of equality delete files? While it would be useful to stream out ingested CDC log from Flink without applying equality deletes, we shouldn't probably target that for now. If we start outputting equality deletes directly, our changelog may not be accurate. What if we upsert a record that didn't exist? What if an equality delete applied to 10 data records? We can find reasonable behavior but let's skip this for now. Let's always resolve and assign equality deletes so that the changelog is precise. > What if a snapshot adds a new delete file for the already deleted record? That can happen both with position and equality deletes. There is one more data point here. We are about to introduce sync maintenance for position deletes. This means new delete files will include old + new deleted positions. Therefore, we must always resolve historic deletes. To sum up, I propose sticking to the existing changelog tasks and always resolving historic deletes to produce a correct changelog. I am concerned about full table scans for each incremental snapshot. I'll look into ideas mentioned above to see if we can optimize that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org