[I] Track uncompressed data size for column metrics [iceberg]

via GitHub Fri, 15 Mar 2024 18:11:39 -0700


ZacBlanco opened a new issue, #9966:
URL: https://github.com/apache/iceberg/issues/9966


   ### Feature Request / Improvement
   
   I have been working in the optimizer on 
[Presto](https://github.com/prestodb/presto) and was auditing the statistics 
returned by Iceberg for column sizes. I tested our statistics implementation 
and found that no matter how we sum the data sizes from iceberg manifests, they 
are 3-4x lower than expected. Below are some queries from a TPC-H table to show 
the sizes from our `SHOW STATS` statement and one using 
`sum_data_size_for_stats` - an aggregation function used when we call `ANALYZE` 
on tables.
   
   ```
   SHOW STATS FOR (select comment from orders);
    column_name | data_size | distinct_values_count | nulls_fraction | 
row_count | low_value | high_value
   
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
    comment     |  167815.0 | NULL                  |            0.0 | NULL     
 | NULL      | NULL
    NULL        | NULL      | NULL                  | NULL           |   
15000.0 | NULL      | NULL
   (2 rows)
   ```
   
   ```
   presto:tpch> select sum_data_size_for_stats(comment) from orders;
    _col0
   --------
    727364
   (1 row)
   ```
   
   The data size reported by the Iceberg metadata severely underestimates. 
After reading the spec more closely, I realized that Iceberg returns the 
on-disk size of the data, including compression. In my opinion, I think this is 
wrong statistic to return. 
   
   Most analytics systems (from my knowledge) won't spend their CPU time 
compressing data because it is going to be more time-efficient to just ship the 
data directly over the wire while the query is executing. At least in Presto, 
the SQL query optimizer takes data sizes into account when determining the 
physical plan for queries such as when data should be exchanged between nodes 
or the physical join implementation.
   
   I tested swapping the current [parquet metric 
implementation](https://github.com/apache/iceberg/blob/560b72344350816eb31f9a165c2947caa7381a9b/parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java#L127)
 to use the parquet footers' `getTotalUncompressedSize` method and found it 
gives much more accurate results - at least for Presto.
   
   ```
   presto:tpch> SHOW STATS FOR (select comment from orders);
    column_name | data_size | distinct_values_count | nulls_fraction | 
row_count | low_value | high_value
   
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
    comment     |  745598.0 | NULL                  |            0.0 | NULL     
 | NULL      | NULL
    NULL        | NULL      | NULL                  | NULL           |   
15000.0 | NULL      | NULL
   (2 rows)
   
   ------
   
   presto:tpch> select sum_data_size_for_stats(comment) from orders;
    _col0
   --------
    727364
   (1 row)
   ```
   
   
   Changing this would probably require a change in the spec. Also, while 
parquet has footer support for uncompressed size, I'm not sure if other file 
formats like ORC do. I am curious what the community thinks about the use cases 
for tracking compressed sizes vs uncompressed sizes.
   
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Track uncompressed data size for column metrics [iceberg]

Reply via email to