toien commented on issue #10907:
URL: https://github.com/apache/iceberg/issues/10907#issuecomment-2282824993

   Snapshots number increased because Flink job still writing data to table.
   
   In my opinion, it's better to clerify `retain_last` parameter's "minimum" 
function in 
[doc](https://iceberg.apache.org/docs/1.4.3/spark-procedures/#expire_snapshots):
 
   
   >Number of ancestor snapshots to preserve regardless of `older_than`. 
   
   ### Summary
   
   After doing some tests, I finally start understanding iceberg's mainteinance 
procedures. Hope this help people are new to iceberg like me.
   
   #### `rewrite_data_files`
   Rewrite data files is a procedure reading source small files, compacting, 
and writing a new one. It **won't** delete old small files.
   
   Data files, as leaf level of iceberg table layer, they belong manifest 
files. Deleting source small files will break its manifest file.
   
   This procedure will optimize data files(usually merging) and create a new 
version(snapshot) of table.
   
   #### `rewrite_manifests`
   Unlike data files, `rewrite_manifests` will replace old ones.
   
   This procedure will optimize manifest files(usually merging) and create a 
new version(snapshot) of table. 
   
   #### `expire_snapshots` 
   Always use `older_than` paramter. 
   
   If data files expected to be deleted still remains in S3 or HDFS, recheck 
metadata tables after executing procedure. They may be *linked* in manifests or 
entries.
   
   #### Maintenance tips
   
   Say we have a table upserting by flink jobs, which will create a lot data 
files and metadata. Hourly executing these would optimize iceberg table:
   
   - `rewrite_data_files`
   - `rewrite_manifests`
   
   When it comes to partitioned table, say partition by day:
   
   - Hourly executing optmizing rewrite procedures on active partition. 
   - Daily executing `expire_snapshots` on old partitions (this is one-time 
job).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to