[I] How can we better collect and manage Iceberg table metadata? [iceberg]

via GitHub Fri, 13 Jun 2025 18:10:37 -0700


slfan1989 opened a new issue, #13311:
URL: https://github.com/apache/iceberg/issues/13311


   ### Query engine
   
   Spark, Flink
   
   ### Question
   
   # Question
   
   Is there any tool or recommended practice for collecting and managing 
metadata for all Iceberg tables in a centralized way?
   
   # Iceberg brings us several key advantages:
   
   - Partition management outside of HMS: This significantly reduces the load 
on the Hive Metastore and helps avoid frequent Full GC issues.
   
   - Comprehensive predicate pushdown: Iceberg supports pushdown on nearly all 
fields, reducing the amount of data scanned and greatly improving query 
performance.
   
   - Efficient storage: Using Parquet with ZSTD compression helps us reduce 
overall storage costs.
   
   # Business background:
   
   Our platform generates a large volume of data on a daily basis, but storage 
resources are limited.
   Therefore, users often define TTL (time-to-live) rules to automatically 
clean up expired data.
   
   ## Previously with Hive tables:
   
   Metadata was stored in HMS;
   
   We regularly synced the `DBS` and `PARTITIONS` tables from MySQL into a Hive 
Table;
   
   Based on the partition creation time, we determined whether data had reached 
TTL;
   
   If expired:
   
   For managed tables: we executed `DROP PARTITION`;
   
   For external tables: we first deleted the data from HDFS, then called `DROP 
PARTITION` to clean up metadata.
   
   ## Now with Iceberg tables:
   
   Each table’s metadata is stored in HDFS (with plans to migrate to S3 in the 
future);
   
   We have written a custom program to manage this, which:
   
   - Lists all Iceberg tables by querying DBS from HMS;
   - Locates each table’s metadata.json file;
   - Uses Iceberg APIs to read PartitionData and extract partition information.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] How can we better collect and manage Iceberg table metadata? [iceberg]

Reply via email to