toien commented on issue #10907: URL: https://github.com/apache/iceberg/issues/10907#issuecomment-2282824993
Snapshots number increased because Flink job still writing data to table. In my opinion, it's better to clerify `retain_last` parameter's "minimum" function in [doc](https://iceberg.apache.org/docs/1.4.3/spark-procedures/#expire_snapshots): >Number of ancestor snapshots to preserve regardless of `older_than`. ### Summary After doing some tests, I finally start understanding iceberg's mainteinance procedures. Hope this help people are new to iceberg like me. #### `rewrite_data_files` Rewrite data files is a procedure reading source small files, compacting, and writing a new one. It **won't** delete old small files. Data files, as leaf level of iceberg table layer, they belong manifest files. Deleting source small files will break its manifest file. This procedure will optimize data files(usually merging) and create a new version(snapshot) of table. #### `rewrite_manifests` Unlike data files, `rewrite_manifests` will replace old ones. This procedure will optimize manifest files(usually merging) and create a new version(snapshot) of table. #### `expire_snapshots` Always use `older_than` paramter. If data files expected to be deleted still remains in S3 or HDFS, recheck metadata tables after executing procedure. They may be *linked* in manifests or entries. #### Maintenance tips Say we have a table upserting by flink jobs, which will create a lot data files and metadata. Hourly executing these would optimize iceberg table: - `rewrite_data_files` - `rewrite_manifests` When it comes to partitioned table, say partition by day: - Hourly executing optmizing rewrite procedures on active partition. - Daily executing `expire_snapshots` on old partitions (this is one-time job). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org