szehon-ho opened a new issue, #10646: URL: https://github.com/apache/iceberg/issues/10646
### Proposed Change **Motivation** Currently, a snapshot's lifecycle is handled by 'ExpireSnapshots(long olderThan)'. This operation does the following: - Choose a set of snapshots to expire based on timestamp - Remove association of these Snapshots from TableMetadata - Purge metadata of these Snapshots - Purge data files of these Snapshots that are not referred to by non-expired snapshots (ie, data files that have been deleted from the table before the olderThan timestamp). Purging deleted data often requires a more aggressive timeline, due to strict requirements to claw back unused disk space, fulfill data lifecycle compliance, etc. In many deployments, this means 'olderThan' timestamp is set to just a few days before the current time (the default is 5 days). On the other hand, purging metadata may be ideally done on a more relaxed timeline, to allow for meaningful historical table analysis. This could ideally be months, or years. But today, the two are purged together and we cannot preserve just the Snapshot metadata, if we choose an aggressive olderThan timestamp for the purpose of purging deleted Snapshot data. **Implementation Summary** Add an addition field to snapshot metadata v3 | Field | Description -- | -- | -- optional | expired | Whether this snapshot has been expired but not purged. Defaults to false In the reference implementation, improve ExpireSnapshots (Core, Spark) to take another parameter: ``` /** * Whether to maintain Snapshot metadata after expiry. */ ExpireSnapshot.purge(purge = true) ``` ExpireSnapshots will continue to purge deleted data files for the Snapshots chosen for expiration as it does today. But now, if purge == false, the Snapshot metadata is maintained, and TableMetadata maintains the Snapshot reference (with the expired flag set to true on the Snapshots). These 'expired' but un-purged Snapshots can be then dis-associated from TableMetadata later by another call to ExpireSnapshots with purge == true, which will also purge their metadata. Expired but un-purged Snapshots behave as if effectively removed, and cannot be the target of rollback or time-travel operations. This is because the data files they refer to may have been purged by ExpireSnapshots operation. They will, however, show up in the TableMetadata's list of snapshots, marked by 'expired' flag. Their metadata can also show up in the 'manifests' and 'files' metadata tables, also marked with an 'expired' flag. ### Proposal document https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit ### Specifications - [X] Table - [ ] View - [ ] REST - [ ] Puffin - [ ] Encryption - [ ] Other -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org