pvary commented on PR #10935:
URL: https://github.com/apache/iceberg/pull/10935#issuecomment-2420515789

   > First of all, we need to discuss the expected behavior:
   > 
   > * Do we want to resolve equality deletes and map them into data files? Or 
should we add a new task and output the content of equality delete files? I'd 
say we should support both options (without resolving by default?).
   
   If the table is written by Flink, then for Flink CDC streaming read the lazy 
solution (not resolving the equality deletes) would be enough. If there is 
another writer for the table which creates non Flink conform equality delete 
files, or the user wants a retracting CDC stream when the table was written by 
an upsert CDC stream, then the equality delete resolution is still needed.
   
   > * What if a snapshot adds a new delete file for the already deleted 
record? That can happen both with position and equality deletes. For instance, 
if two position-based delete operations remove the same set of records, both of 
them will succeed. Producing a precise CDC log would require reading all 
historic deletes, which may be unnecessary expensive in some cases. I'd say 
this should be configurable as well.
   
   I think that applying delete files is less costly, so I would stick to the 
theoretically correct solution, and apply previously added delete files to 
produce the precise CDC log
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to