pvary commented on code in PR #14264:
URL: https://github.com/apache/iceberg/pull/14264#discussion_r2419065337


##########
core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java:
##########
@@ -71,6 +80,12 @@ protected CloseableIterable<ChangelogScanTask> doPlanFiles(
             .filter(manifest -> 
changelogSnapshotIds.contains(manifest.snapshotId()))
             .toSet();
 
+    // Build delete file index for existing deletes (before the start snapshot)
+    DeleteFileIndex existingDeleteIndex = 
buildExistingDeleteIndex(fromSnapshotIdExclusive);

Review Comment:
   @talatuyarer: Don't forget about overlapping equality deletes too.
   
   Flink could generate the following:
   - Snapshot 1 - Insert a record (data file (D1) added with PK1)
   - Snapshot 2 - Update a record (equality delete file (EQ1) added with PK1, 
and data file (D2) added with PK1)
   - Snapshot 3 - Update a record again (equality delete file (EQ2) added with 
PK1, and data file (D3) added with PK1)
   
   When we are emitting rows for Snapshot 3:
   - We need to scan D1 but should not emit a delete record for PK1 as it was 
already deleted in Snapshot 2
   - We need to scan D2 and emit a delete record for PK1 as it was not deleted 
in Snapshot 2
   - We need to scan D3 and emit the data record for PK1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to