Samrose-Ahmed opened a new issue, #10703:
URL: https://github.com/apache/iceberg/issues/10703

   ### Proposed Change
   
   it is not currently possible to determine the uncompressed unencoded size of 
variable length columns. It is possible to do so for fixed length data types 
using the null count and row count statistics but not possible for variable 
length column types like strings (or binary), since the data is encoded and 
compressed. Tracking data size is useful for many purposes, including engine 
planning/query optimization for e.g. planning for data exchange or join, as 
well as for readers to estimate memory for reading data.
   
   We propose adding a new optional property similar to `columnSizes` inside 
the manifest files. This will be a map from field id to number of uncompressed 
unencoded size bytes. This should only be set for variable length type columns 
(String/Binary).
   
   
   Add the following to the `manifest_entry.data_file` struct:
   
   | _optional_ | _optional_ | `142 variable_length_column_sizes` | `map<143: 
int, 144: long>` | Map from column id to the uncompressed unencoded size of all 
regions that store the column. Only valid for variable length types like 
string/byte array. |
   
|------------|------------|-----------------------------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
   
   
   See also  [Parquet format SizeStatistics and 
`unencoded_byte_array_data_bytes`](https://github.com/apache/parquet-format/commit/40699d05bd24181de6b1457babbee2c16dce3803).
   
   Relevant Github Issues:
   * https://github.com/apache/iceberg/issues/8274
   * https://github.com/apache/iceberg/issues/9966
   
   
   ### Proposal document
   
   
https://docs.google.com/document/d/189kIZxx_dUloBCDPUz2Fh0BBOZSm2fXHHXWpdpq3DrU
   
   ### Specifications
   
   - [X] Table
   - [ ] View
   - [ ] REST
   - [ ] Puffin
   - [ ] Encryption
   - [ ] Other


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to