pvary commented on PR #10935: URL: https://github.com/apache/iceberg/pull/10935#issuecomment-2420515789
> First of all, we need to discuss the expected behavior: > > * Do we want to resolve equality deletes and map them into data files? Or should we add a new task and output the content of equality delete files? I'd say we should support both options (without resolving by default?). If the table is written by Flink, then for Flink CDC streaming read the lazy solution (not resolving the equality deletes) would be enough. If there is another writer for the table which creates non Flink conform equality delete files, or the user wants a retracting CDC stream when the table was written by an upsert CDC stream, then the equality delete resolution is still needed. > * What if a snapshot adds a new delete file for the already deleted record? That can happen both with position and equality deletes. For instance, if two position-based delete operations remove the same set of records, both of them will succeed. Producing a precise CDC log would require reading all historic deletes, which may be unnecessary expensive in some cases. I'd say this should be configurable as well. I think that applying delete files is less costly, so I would stick to the theoretically correct solution, and apply previously added delete files to produce the precise CDC log -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org