dramaticlly commented on PR #9724:
URL: https://github.com/apache/iceberg/pull/9724#issuecomment-1965684365

   > Okay, I did one pass and here are my high-level notes:
   > 
   > * We should use `RewriteFiles` instead of `DeleteFiles`, changes in 
`DeleteFiles` should be reverted.
   > * I don't see a need for the enum to control the cleanup mode.
   > * I'd consider having a separate action but I can be convinced otherwise. 
Especially, given that we may account for partition stats in the future.
   > * I'd consider the following algorithm:
   >   
   >   * Extend `data_files` and `delete_files` metadata tables to include data 
sequence numbers, if needed. I don't remember if we already populate them. This 
should be trivial as each `DeleteFile` object already has this info.
   >   * Query `data_files`, aggregate, compute min data sequence number per 
partition. Don't cache the computed result, just keep a reference to it.
   >   * Query `delete_files`, potentially projecting only strictly required 
columns.
   >   * Join the summary with `delete_files` on the spec ID and partition. 
Find delete files that can be discarded in one go by having a predicate that 
accounts for the delete type (position vs equality).
   >   * Collect the result to the driver and use `SparkDeleteFile` to wrap 
Spark rows as valid delete files. See the action for rewriting manifests for an 
example.
   
   Based on Anton's feedback, I will try divide the changes into 2 PRs where 
first PR (#9813) to support data sequence number in data and delete files 
table. Once merged, I will update to scan data_files first to aggregate per 
spec/partition min data sequence number, then compare against the delete_files. 
With left join, we can identify dangling deletes and remove them in one pass. 
SparkDeleteFile will be used to convert from spark row to POJO to be used for 
pruning, in consideration of partition evolution. Finally, dangling delete will 
be removed by reconstruct instead of by file path, to benefit manifest pruning 
when iceberg table was scanned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to