szehon-ho opened a new issue, #10646:
URL: https://github.com/apache/iceberg/issues/10646

   ### Proposed Change
   
   **Motivation**
   
   Currently, a snapshot's lifecycle is handled by 'ExpireSnapshots(long 
olderThan)'.  This operation does the following:
   
   - Choose a set of snapshots to expire based on timestamp
   - Remove association of these Snapshots from TableMetadata
   - Purge metadata of these Snapshots
   - Purge data files of these Snapshots that are not referred to by 
non-expired snapshots (ie, data files that have been deleted from the table 
before the olderThan timestamp).
   
   Purging deleted data often requires a more aggressive timeline, due to 
strict requirements to claw back unused disk space, fulfill data lifecycle 
compliance, etc.  In many deployments, this means 'olderThan' timestamp is set 
to just a few days before the current time (the default is 5 days).
   
   On the other hand, purging metadata may be ideally done on a more relaxed 
timeline, to allow for meaningful historical table analysis.  This could 
ideally be months, or years.
   
   But today, the two are purged together and we cannot preserve just the 
Snapshot metadata, if we choose an aggressive olderThan timestamp for the 
purpose of purging deleted Snapshot data.
   
   
   **Implementation Summary**
   
   Add an addition field to snapshot metadata
   v3 | Field | Description
   -- | -- | --
   optional | expired | Whether this snapshot has been expired but not purged.  
Defaults to false
   
   In the reference implementation, improve ExpireSnapshots (Core, Spark) to 
take another parameter:
   
   ```
   /**
    * Whether to maintain Snapshot metadata after expiry.
    */
   ExpireSnapshot.purge(purge = true)
   ```
   
   ExpireSnapshots will continue to purge deleted data files for the Snapshots 
chosen for expiration as it does today.  But now, if purge == false, the 
Snapshot metadata is maintained, and TableMetadata maintains the Snapshot 
reference (with the expired flag set to true on the Snapshots).
   
   These 'expired' but un-purged Snapshots can be then dis-associated from 
TableMetadata later by another call to ExpireSnapshots with purge == true, 
which will also purge their metadata.
   
   Expired but un-purged Snapshots behave as if effectively removed, and cannot 
be the target of rollback or time-travel operations.  This is because the data 
files they refer to may have been purged by ExpireSnapshots operation.  They 
will, however, show up in the TableMetadata's list of snapshots, marked by 
'expired' flag.  Their metadata can also show up in the 'manifests' and 'files' 
metadata tables, also marked with an 'expired' flag.
   
   
   ### Proposal document
   
   
https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit
   
   ### Specifications
   
   - [X] Table
   - [ ] View
   - [ ] REST
   - [ ] Puffin
   - [ ] Encryption
   - [ ] Other


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to