ZachDischner opened a new issue, #7892:
URL: https://github.com/apache/iceberg/issues/7892

   ### Feature Request / Improvement
   
   I wish to inspect the `catalog.db.table.partitions` metadata table, but 
cannot do so for large tables. Even on extremely large clusters I receive 
out-of-memory errors. Happens for tables with as few as 4 million partitions. 
   
   **Using `partitions` table directly**
   
   Times out, results in out of memory errors. Spark shows that there is only 
one task allocated, so this appears to not treat partitions as a big-data 
problem. 
   ```
   spark.read.format("iceberg").load("catalog.db.table.partitions").count
   ```
   
   **Indirectly obtaining `partitions` information via `files` metadata table**
   
   Inspecting the `files` metadata is a sufficient workaround. `files` metadata 
is treated as a big data problem so we can sufficiently parallelize 
   ```
   
spark.read.format("iceberg").load("catalog.db.table.files").agg(count("*").as("FileCount"),
 count_distinct(col("partition"))).as("PartitionCount").show
   +---------+----------------+
   |FileCount|count(partition)|
   +---------+----------------+
   |  4773395|         4302859|
   +---------+----------------+
   ```
   
   ### Query engine
   
   Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to