ZachDischner opened a new issue, #7892:
URL: https://github.com/apache/iceberg/issues/7892
### Feature Request / Improvement
I wish to inspect the `catalog.db.table.partitions` metadata table, but
cannot do so for large tables. Even on extremely large clusters I receive
out-of-memory errors. Happens for tables with as few as 4 million partitions.
**Using `partitions` table directly**
Times out, results in out of memory errors. Spark shows that there is only
one task allocated, so this appears to not treat partitions as a big-data
problem.
```
spark.read.format("iceberg").load("catalog.db.table.partitions").count
```
**Indirectly obtaining `partitions` information via `files` metadata table**
Inspecting the `files` metadata is a sufficient workaround. `files` metadata
is treated as a big data problem so we can sufficiently parallelize
```
spark.read.format("iceberg").load("catalog.db.table.files").agg(count("*").as("FileCount"),
count_distinct(col("partition"))).as("PartitionCount").show
+---------+----------------+
|FileCount|count(partition)|
+---------+----------------+
| 4773395| 4302859|
+---------+----------------+
```
### Query engine
Spark
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]