Re: [PR] Support changelog scan for table with delete files [iceberg]

via GitHub Wed, 16 Oct 2024 15:21:10 -0700


aokolnychyi commented on PR #10935:
URL: https://github.com/apache/iceberg/pull/10935#issuecomment-2418068303


   I went through some of my old notes, which we should discuss.
   
   We have the following tasks right now:
   - `AddedRowsScanTask` (added data file + deletes that happened within the 
same snapshot).
   - `DeletedDataFileScanTask` (removed data file + deletes that applied to it 
before it was removed).
   - `DeletedRowsScanTask` (data file that was affected by a delete file (if we 
resolve equality deletes or we had position deletes) + historic deletes that 
were there before + new deletes added in this snapshot).
   
   First of all, we need to discuss the expected behavior:
   - Do we want to resolve equality deletes and map them into data files? Or 
should we add a new task and output the content of equality delete files? I'd 
say we should support both options (without resolving by default?).
   - What if a snapshot adds a new delete file for the already deleted record? 
That can happen both with position and equality deletes. For instance, if two 
position-based delete operations remove the same set of records, both of them 
will succeed. Producing a precise CDC log would require reading all historic 
deletes, which may be unnecessary expensive in some cases. I'd say this should 
be configurable as well.
   
   If we want to make resolving equality deletes optional, we need 
`AddedEqualityDeletesScanTask`. We discussed this for Flink CDC use cases. It 
is going to be costly at planning and query time to apply equality deletes to 
data files to get removed records. As long as the equality delete persists the 
entire row or the caller is OK with only equality columns, it should be fine to 
output the content of equality delete files as is.
   
   I agree with the idea of iterating snapshot by snapshot. I wonder if we can 
optimize it, though.
   
   **AddedRowsScanTask**
   
   We don’t have to look up historic deletes as the file was just added, except 
the position deletes added in the same snapshot. For each added data file, it 
is enough to look up matching position deletes in `DeleteFileIndex` built from 
delete manifests added in this snapshot.
   
   **DeletedDataFileScanTask**
   
   We have to build `DeleteFileIndex` that includes all historic deletes for 
removed data files. We can optimize this step by reading delete manifests only 
for the affected partitions. We can create a `PartitionMap` predicate from new 
data files and use it while reading delete manifests. We can supplement that 
with a predicate on `referencedDataFile` in the future. Historic delete files 
that are not affecting deleted data file partitions can be discarded.
   
   **DeletedRowsScanTask**
   
   For each position delete file and each equality delete file (if those must 
be resolved), find matching data files and historic deletes (if configured to 
resolve double deletes). We can still build a partition predicate from all 
delete files that were added in this snapshot. We can use that predicate to 
prune the set of data manifests for that snapshot. For each newly added delete 
file, output a task for each affected data file (one delete file can match 
multiple data files).
   
   **AddedEqualityDeletesScanTask**
   
   Output each added equality delete file as a separate task without matching 
them to data files (if configured).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support changelog scan for table with delete files [iceberg]

Reply via email to