Quanlong Huang created IMPALA-12962:
---------------------------------------

             Summary: Estimated metadata size of a table doesn't match the 
actual java object size
                 Key: IMPALA-12962
                 URL: https://issues.apache.org/jira/browse/IMPALA-12962
             Project: IMPALA
          Issue Type: Bug
          Components: Catalog
            Reporter: Quanlong Huang


Catalogd shows the top-25 largest tables in its WebUI at the "/catalog" 
endpoint. The estimated metadata size is computed in HdfsTable#getTHdfsTable():
[https://github.com/apache/impala/blob/0d49c9d6cc7fc0903d60a78d8aaa996af0249c06/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L2414-L2451]
The current formula is
 * memUsageEstimate = numPartitions * 2KB + numFiles * 500B + numBlocks * 150B 
+ (optional) incrementalStats
 * (optional) incrementalStats = numPartitions * numColumns * 200B

It's ok to use this formula to compare tables. But it can't be used to estimate 
the max heap size of catalogd. E.g. it doesn't consider the column comments and 
tblproperties which could have long strings. Column names should also be 
considered in case the table is a wide table.

We can compare the estimated sizes with results from ehcache-sizeof or jamm and 
update the formula. Or use these libraries to estimate the sizes directly if 
they won't impact the performance.

CC [~MikaelSmith] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to