aokolnychyi commented on PR #10935: URL: https://github.com/apache/iceberg/pull/10935#issuecomment-2418068303
I went through some of my old notes, which we should discuss. We have the following tasks right now: - `AddedRowsScanTask` (added data file + deletes that happened within the same snapshot). - `DeletedDataFileScanTask` (removed data file + deletes that applied to it before it was removed). - `DeletedRowsScanTask` (data file that was affected by a delete file (if we resolve equality deletes or we had position deletes) + historic deletes that were there before + new deletes added in this snapshot). First of all, we need to discuss the expected behavior: - Do we want to resolve equality deletes and map them into data files? Or should we add a new task and output the content of equality delete files? I'd say we should support both options (without resolving by default?). - What if a snapshot adds a new delete file for the already deleted record? That can happen both with position and equality deletes. For instance, if two position-based delete operations remove the same set of records, both of them will succeed. Producing a precise CDC log would require reading all historic deletes, which may be unnecessary expensive in some cases. I'd say this should be configurable as well. If we want to make resolving equality deletes optional, we need `AddedEqualityDeletesScanTask`. We discussed this for Flink CDC use cases. It is going to be costly at planning and query time to apply equality deletes to data files to get removed records. As long as the equality delete persists the entire row or the caller is OK with only equality columns, it should be fine to output the content of equality delete files as is. I agree with the idea of iterating snapshot by snapshot. I wonder if we can optimize it, though. **AddedRowsScanTask** We don’t have to look up historic deletes as the file was just added, except the position deletes added in the same snapshot. For each added data file, it is enough to look up matching position deletes in `DeleteFileIndex` built from delete manifests added in this snapshot. **DeletedDataFileScanTask** We have to build `DeleteFileIndex` that includes all historic deletes for removed data files. We can optimize this step by reading delete manifests only for the affected partitions. We can create a `PartitionMap` predicate from new data files and use it while reading delete manifests. We can supplement that with a predicate on `referencedDataFile` in the future. Historic delete files that are not affecting deleted data file partitions can be discarded. **DeletedRowsScanTask** For each position delete file and each equality delete file (if those must be resolved), find matching data files and historic deletes (if configured to resolve double deletes). We can still build a partition predicate from all delete files that were added in this snapshot. We can use that predicate to prune the set of data manifests for that snapshot. For each newly added delete file, output a task for each affected data file (one delete file can match multiple data files). **AddedEqualityDeletesScanTask** Output each added equality delete file as a separate task without matching them to data files (if configured). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org