talatuyarer opened a new pull request, #14264: URL: https://github.com/apache/iceberg/pull/14264
This PR extends `BaseIncrementalChangelogScan` to support positional and equality delete files (MoR use cases), enabling complete CDC coverage. I read all discussion on @wypoon 's #10935 The PR was old and also it does not cover all cases. I decided to implement with a fresh approach but reduced scope. Initial PR had spark implementation. in this PR I will only focus on core module support. With this implementation, changelog scans now correctly produce `DeletedRowsScanTask`, `DeletedDataFileScanTask`, and `AddedRowsScanTask` even when delete files are present. _I implemented unit tests for following scenario behaviors_ Scenario | Expected Behavior -- | -- Insert → Equality Delete → Re-insert | One DELETE + one INSERT changelog event Insert & Delete in Same Commit | Emitted as one task with attached deletes Overlapping Equality + Position Deletes | Deduplicated delete application Existing Deletes + New Deletes | Both applied without duplication Overwrite Snapshot with Deletes | No NPE, correct ADDED/DELETED tasks Large Delete Set | Pruned by partition filter Step by Step Implementation Design: 1. Delete Index Construction Two types of delete indexes are now built: ExistingDeleteIndex – built from delete manifests before the scan start snapshot. Used to ensure previous deletes are applied but not re-emitted as changelog events. This helps keeping row lineage for CDC. AddedDeletesBySnapshot – per-snapshot indexes of newly added delete files in each changelog snapshot. 2. Task Planning - Data File Changes Each changelog snapshot incrementally accumulates delete files. `CreateDataFileChangeTasks` now attaches applicable deletes to each data file task: - `AddedRowsScanTask` → combines existing + newly added deletes - `DeletedDataFileScanTask` → includes existing deletes (before deletion) 3. Task Planning For Existing Data Files with New Deletes planDeletedRowsTasks() identifies EXISTING data files affected by newly added delete files and emits corresponding DeletedRowsScanTask objects. Processing multiple snapshot and equality deletes are expensive operations for huge tables. I added few performance optimization. Optimization | Description -- | -- Manifest caching | Introduced DeleteManifestCache to avoid redundant manifest parsing Partition pruning | Uses filter pushdown to skip irrelevant delete manifests Incremental accumulation | Avoids rebuilding the delete index for every snapshot Selective copy | ContentFileUtil.copy() keeps only essential stats to reduce memory footprint Possible Future Work - V3 Deletion Vector support. I am trying to define foundation for Deletion Vector base CDC. - Add persistent delete index cache - Engine level support -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
