Anton-Tarazi opened a new issue, #2604: URL: https://github.com/apache/iceberg-python/issues/2604
### Feature Request / Improvement Running an expire snapshots operation will only rewrite the metadata file without the expired snapshots (and refs/ statistics). It does not delete deleted data files referenced only by the expired snapshots. This can be observed by deleting an entire table and calling `expire_snapshots` - the data files still exist. Trino and spark both clean up deleted data files when all snapshots referencing them are expired. From the spec: ``` When a file is replaced or deleted from the dataset, its manifest entry fields store the snapshot ID in which the file was deleted and status 2 (deleted). The file may be deleted from the file system when the snapshot in which it was deleted is garbage collected, assuming that older snapshots have also been garbage collected [1]. ... 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires. It is not recommended to add a deleted file back to a table. Adding a deleted file can lead to edge cases where incremental deletes can break table snapshots. ``` Happy to work on this if others agree that this should be added :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
