Samrose-Ahmed opened a new issue, #10703: URL: https://github.com/apache/iceberg/issues/10703
### Proposed Change it is not currently possible to determine the uncompressed unencoded size of variable length columns. It is possible to do so for fixed length data types using the null count and row count statistics but not possible for variable length column types like strings (or binary), since the data is encoded and compressed. Tracking data size is useful for many purposes, including engine planning/query optimization for e.g. planning for data exchange or join, as well as for readers to estimate memory for reading data. We propose adding a new optional property similar to `columnSizes` inside the manifest files. This will be a map from field id to number of uncompressed unencoded size bytes. This should only be set for variable length type columns (String/Binary). Add the following to the `manifest_entry.data_file` struct: | _optional_ | _optional_ | `142 variable_length_column_sizes` | `map<143: int, 144: long>` | Map from column id to the uncompressed unencoded size of all regions that store the column. Only valid for variable length types like string/byte array. | |------------|------------|-----------------------------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------| See also [Parquet format SizeStatistics and `unencoded_byte_array_data_bytes`](https://github.com/apache/parquet-format/commit/40699d05bd24181de6b1457babbee2c16dce3803). Relevant Github Issues: * https://github.com/apache/iceberg/issues/8274 * https://github.com/apache/iceberg/issues/9966 ### Proposal document https://docs.google.com/document/d/189kIZxx_dUloBCDPUz2Fh0BBOZSm2fXHHXWpdpq3DrU ### Specifications - [X] Table - [ ] View - [ ] REST - [ ] Puffin - [ ] Encryption - [ ] Other -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org