dramaticlly commented on PR #9724: URL: https://github.com/apache/iceberg/pull/9724#issuecomment-1965684365
> Okay, I did one pass and here are my high-level notes: > > * We should use `RewriteFiles` instead of `DeleteFiles`, changes in `DeleteFiles` should be reverted. > * I don't see a need for the enum to control the cleanup mode. > * I'd consider having a separate action but I can be convinced otherwise. Especially, given that we may account for partition stats in the future. > * I'd consider the following algorithm: > > * Extend `data_files` and `delete_files` metadata tables to include data sequence numbers, if needed. I don't remember if we already populate them. This should be trivial as each `DeleteFile` object already has this info. > * Query `data_files`, aggregate, compute min data sequence number per partition. Don't cache the computed result, just keep a reference to it. > * Query `delete_files`, potentially projecting only strictly required columns. > * Join the summary with `delete_files` on the spec ID and partition. Find delete files that can be discarded in one go by having a predicate that accounts for the delete type (position vs equality). > * Collect the result to the driver and use `SparkDeleteFile` to wrap Spark rows as valid delete files. See the action for rewriting manifests for an example. Based on Anton's feedback, I will try divide the changes into 2 PRs where first PR (#9813) to support data sequence number in data and delete files table. Once merged, I will update to scan data_files first to aggregate per spec/partition min data sequence number, then compare against the delete_files. With left join, we can identify dangling deletes and remove them in one pass. SparkDeleteFile will be used to convert from spark row to POJO to be used for pruning, in consideration of partition evolution. Finally, dangling delete will be removed by reconstruct instead of by file path, to benefit manifest pruning when iceberg table was scanned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org