ZacBlanco opened a new issue, #9966: URL: https://github.com/apache/iceberg/issues/9966
### Feature Request / Improvement I have been working in the optimizer on [Presto](https://github.com/prestodb/presto) and was auditing the statistics returned by Iceberg for column sizes. I tested our statistics implementation and found that no matter how we sum the data sizes from iceberg manifests, they are 3-4x lower than expected. Below are some queries from a TPC-H table to show the sizes from our `SHOW STATS` statement and one using `sum_data_size_for_stats` - an aggregation function used when we call `ANALYZE` on tables. ``` SHOW STATS FOR (select comment from orders); column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value -------------+-----------+-----------------------+----------------+-----------+-----------+------------ comment | 167815.0 | NULL | 0.0 | NULL | NULL | NULL NULL | NULL | NULL | NULL | 15000.0 | NULL | NULL (2 rows) ``` ``` presto:tpch> select sum_data_size_for_stats(comment) from orders; _col0 -------- 727364 (1 row) ``` The data size reported by the Iceberg metadata severely underestimates. After reading the spec more closely, I realized that Iceberg returns the on-disk size of the data, including compression. In my opinion, I think this is wrong statistic to return. Most analytics systems (from my knowledge) won't spend their CPU time compressing data because it is going to be more time-efficient to just ship the data directly over the wire while the query is executing. At least in Presto, the SQL query optimizer takes data sizes into account when determining the physical plan for queries such as when data should be exchanged between nodes or the physical join implementation. I tested swapping the current [parquet metric implementation](https://github.com/apache/iceberg/blob/560b72344350816eb31f9a165c2947caa7381a9b/parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java#L127) to use the parquet footers' `getTotalUncompressedSize` method and found it gives much more accurate results - at least for Presto. ``` presto:tpch> SHOW STATS FOR (select comment from orders); column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value -------------+-----------+-----------------------+----------------+-----------+-----------+------------ comment | 745598.0 | NULL | 0.0 | NULL | NULL | NULL NULL | NULL | NULL | NULL | 15000.0 | NULL | NULL (2 rows) ------ presto:tpch> select sum_data_size_for_stats(comment) from orders; _col0 -------- 727364 (1 row) ``` Changing this would probably require a change in the spec. Also, while parquet has footer support for uncompressed size, I'm not sure if other file formats like ORC do. I am curious what the community thinks about the use cases for tracking compressed sizes vs uncompressed sizes. ### Query engine None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org