Quanlong Huang created IMPALA-12962:
---------------------------------------
Summary: Estimated metadata size of a table doesn't match the
actual java object size
Key: IMPALA-12962
URL: https://issues.apache.org/jira/browse/IMPALA-12962
Project: IMPALA
Issue Type: Bug
Components: Catalog
Reporter: Quanlong Huang
Catalogd shows the top-25 largest tables in its WebUI at the "/catalog"
endpoint. The estimated metadata size is computed in HdfsTable#getTHdfsTable():
[https://github.com/apache/impala/blob/0d49c9d6cc7fc0903d60a78d8aaa996af0249c06/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L2414-L2451]
The current formula is
* memUsageEstimate = numPartitions * 2KB + numFiles * 500B + numBlocks * 150B
+ (optional) incrementalStats
* (optional) incrementalStats = numPartitions * numColumns * 200B
It's ok to use this formula to compare tables. But it can't be used to estimate
the max heap size of catalogd. E.g. it doesn't consider the column comments and
tblproperties which could have long strings. Column names should also be
considered in case the table is a wide table.
We can compare the estimated sizes with results from ehcache-sizeof or jamm and
update the formula. Or use these libraries to estimate the sizes directly if
they won't impact the performance.
CC [~MikaelSmith]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)