talatuyarer opened a new pull request, #14264:
URL: https://github.com/apache/iceberg/pull/14264

   This PR extends `BaseIncrementalChangelogScan` to support positional and 
equality delete files (MoR use cases), enabling complete CDC coverage.
   
   I read all discussion on @wypoon 's #10935 The PR was old and also it does 
not cover all cases. I decided to implement with a fresh approach but reduced 
scope. Initial PR had spark implementation. in this PR I will only focus on 
core module support. With this implementation, changelog scans now correctly 
produce `DeletedRowsScanTask`, `DeletedDataFileScanTask`, and 
`AddedRowsScanTask` even when delete files are present. 
   
   _I implemented unit tests for following scenario behaviors_
   Scenario | Expected Behavior
   -- | -- 
   Insert → Equality Delete → Re-insert | One DELETE + one INSERT changelog 
event
   Insert & Delete in Same Commit | Emitted as one task with attached deletes
   Overlapping Equality + Position Deletes | Deduplicated delete application
   Existing Deletes + New Deletes | Both applied without duplication
   Overwrite Snapshot with Deletes | No NPE, correct ADDED/DELETED tasks
   Large Delete Set | Pruned by partition filter
   
   Step by Step Implementation Design:
   1. Delete Index Construction
   Two types of delete indexes are now built:
   ExistingDeleteIndex – built from delete manifests before the scan start 
snapshot. Used to ensure previous deletes are applied but not re-emitted as 
changelog events. This helps keeping row lineage for CDC.  
   AddedDeletesBySnapshot – per-snapshot indexes of newly added delete files in 
each changelog snapshot.
   
   2. Task Planning - Data File Changes
   Each changelog snapshot incrementally accumulates delete files. 
`CreateDataFileChangeTasks` now attaches applicable deletes to each data file 
task:
   - `AddedRowsScanTask` → combines existing + newly added deletes
   - `DeletedDataFileScanTask` → includes existing deletes (before deletion)
   3. Task Planning For Existing Data Files with New Deletes
   planDeletedRowsTasks() identifies EXISTING data files affected by newly 
added delete files and emits corresponding DeletedRowsScanTask objects. 
   
   Processing multiple snapshot and equality deletes are expensive operations 
for huge tables. I added few performance optimization. 
   
   Optimization | Description
   -- | --
   Manifest caching | Introduced DeleteManifestCache to avoid redundant 
manifest parsing
   Partition pruning | Uses filter pushdown to skip irrelevant delete manifests
   Incremental accumulation | Avoids rebuilding the delete index for every 
snapshot
   Selective copy | ContentFileUtil.copy() keeps only essential stats to reduce 
memory footprint
   
   Possible Future Work
   - V3 Deletion Vector support. I am trying to define foundation for Deletion 
Vector base CDC.
   - Add persistent delete index cache
   - Engine level support
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to