Re: [PR] Docs: Deprecate data_file.distinct_counts in v3 [iceberg]

via GitHub Thu, 06 Feb 2025 05:26:21 -0800


nastra commented on code in PR #12182:
URL: https://github.com/apache/iceberg/pull/12182#discussion_r1944722545



##########
format/spec.md:
##########
@@ -587,32 +587,32 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 
 `data_file` is a struct with the following fields:
 
-| v1         | v2         | v3         | Field id, name                    | 
Type                                                                        | 
Description                                                                     
                                                                                
                                                   |
-| ---------- 
|------------|------------|-----------------------------------|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-|            | _required_ | _required_ | **`134  content`**                | 
`int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | 
Type of content stored by the data file: data, equality deletes, or position 
deletes (all v1 files are data files)                                           
                                                      |
-| _required_ | _required_ | _required_ | **`100  file_path`**              | 
`string`                                                                    | 
Full URI for the file with FS scheme                                            
                                                                                
                                                   |
-| _required_ | _required_ | _required_ | **`101  file_format`**            | 
`string`                                                                    | 
String file format name, `avro`, `orc`, `parquet`, or `puffin`                  
                                                                                
                                                   |
-| _required_ | _required_ | _required_ | **`102  partition`**              | 
`struct<...>`                                                               | 
Partition data tuple, schema based on the partition spec output using partition 
field ids for the struct field ids                                              
                                                   |
-| _required_ | _required_ | _required_ | **`103  record_count`**           | 
`long`                                                                      | 
Number of records in this file, or the cardinality of a deletion vector         
                                                                                
                                                   |
-| _required_ | _required_ | _required_ | **`104  file_size_in_bytes`**     | 
`long`                                                                      | 
Total file size in bytes                                                        
                                                                                
                                                   |
-| _required_ |            |            | ~~**`105 block_size_in_bytes`**~~ | 
`long`                                                                      | 
**Deprecated. Always write a default in v1. Do not write in v2 or v3.**         
                                                                                
                                                   |
-| _optional_ |            |            | ~~**`106  file_ordinal`**~~       | 
`int`                                                                       | 
**Deprecated. Do not write.**                                                   
                                                                                
                                                   |
-| _optional_ |            |            | ~~**`107  sort_columns`**~~       | 
`list<112: int>`                                                            | 
**Deprecated. Do not write.**                                                   
                                                                                
                                                   |
-| _optional_ | _optional_ | _optional_ | **`108  column_sizes`**           | 
`map<117: int, 118: long>`                                                  | 
Map from column id to the total size on disk of all regions that store the 
column. Does not include bytes necessary to read other columns, like footers. 
Leave null for row-oriented formats (Avro)                |
-| _optional_ | _optional_ | _optional_ | **`109  value_counts`**           | 
`map<119: int, 120: long>`                                                  | 
Map from column id to number of values in the column (including null and NaN 
values)                                                                         
                                                      |
-| _optional_ | _optional_ | _optional_ | **`110  null_value_counts`**      | 
`map<121: int, 122: long>`                                                  | 
Map from column id to number of null values in the column                       
                                                                                
                                                   |
-| _optional_ | _optional_ | _optional_ | **`137  nan_value_counts`**       | 
`map<138: int, 139: long>`                                                  | 
Map from column id to number of NaN values in the column                        
                                                                                
                                                   |
-| _optional_ | _optional_ | _optional_ | **`111  distinct_counts`**        | 
`map<123: int, 124: long>`                                                  | 
Map from column id to number of distinct values in the column; distinct counts 
must be derived using values in the file by counting or using sketches, but not 
using methods like merging existing distinct counts |
-| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**           | 
`map<126: int, 127: binary>`                                                | 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]                                     |
-| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**           | 
`map<129: int, 130: binary>`                                                | 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]                                  |
-| _optional_ | _optional_ | _optional_ | **`131  key_metadata`**           | 
`binary`                                                                    | 
Implementation-specific key metadata for encryption                             
                                                                                
                                                   |
-| _optional_ | _optional_ | _optional_ | **`132  split_offsets`**          | 
`list<133: long>`                                                           | 
Split offsets for the data file. For example, all row group offsets in a 
Parquet file. Must be sorted ascending                                          
                                                          |
-|            | _optional_ | _optional_ | **`135  equality_ids`**           | 
`list<136: int>`                                                            | 
Field ids used to determine row equality in equality delete files. Required 
when `content=2` and should be null otherwise. Fields with ids listed in this 
column must be present in the delete file                |
-| _optional_ | _optional_ | _optional_ | **`140  sort_order_id`**          | 
`int`                                                                       | 
ID representing sort order for this file [3].                                   
                                                                                
                                                   |
-|            |            | _optional_ | **`142  first_row_id`**           | 
`long`                                                                      | 
The `_row_id` for the first row in the data file. See [First Row ID 
Inheritance](#first-row-id-inheritance)                                         
                                                               |
-|            | _optional_ | _optional_ | **`143  referenced_data_file`**   | 
`string`                                                                    | 
Fully qualified location (URI with FS scheme) of a data file that all deletes 
reference [4]                                                                   
                                                     |
-|            |            | _optional_ | **`144  content_offset`**         | 
`long`                                                                      | 
The offset in the file where the content starts [5]                             
                                                                                
                                                   |
-|            |            | _optional_ | **`145  content_size_in_bytes`**  | 
`long`                                                                      | 
The length of a referenced content stored in the file; required if 
`content_offset` is present [5]                                                 
                                                                |
+| v1         | v2         | v3         | Field id, name                    | 
Type                                                                        | 
Description                                                                     
                                                                                
                                    |
+| ---------- 
|------------|------------|-----------------------------------|-----------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|            | _required_ | _required_ | **`134  content`**                | 
`int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | 
Type of content stored by the data file: data, equality deletes, or position 
deletes (all v1 files are data files)                                           
                                       |
+| _required_ | _required_ | _required_ | **`100  file_path`**              | 
`string`                                                                    | 
Full URI for the file with FS scheme                                            
                                                                                
                                    |
+| _required_ | _required_ | _required_ | **`101  file_format`**            | 
`string`                                                                    | 
String file format name, `avro`, `orc`, `parquet`, or `puffin`                  
                                                                                
                                    |
+| _required_ | _required_ | _required_ | **`102  partition`**              | 
`struct<...>`                                                               | 
Partition data tuple, schema based on the partition spec output using partition 
field ids for the struct field ids                                              
                                    |
+| _required_ | _required_ | _required_ | **`103  record_count`**           | 
`long`                                                                      | 
Number of records in this file, or the cardinality of a deletion vector         
                                                                                
                                    |
+| _required_ | _required_ | _required_ | **`104  file_size_in_bytes`**     | 
`long`                                                                      | 
Total file size in bytes                                                        
                                                                                
                                    |
+| _required_ |            |            | ~~**`105 block_size_in_bytes`**~~ | 
`long`                                                                      | 
**Deprecated. Always write a default in v1. Do not write in v2 or v3.**         
                                                                                
                                    |
+| _optional_ |            |            | ~~**`106  file_ordinal`**~~       | 
`int`                                                                       | 
**Deprecated. Do not write.**                                                   
                                                                                
                                    |
+| _optional_ |            |            | ~~**`107  sort_columns`**~~       | 
`list<112: int>`                                                            | 
**Deprecated. Do not write.**                                                   
                                                                                
                                    |
+| _optional_ | _optional_ | _optional_ | **`108  column_sizes`**           | 
`map<117: int, 118: long>`                                                  | 
Map from column id to the total size on disk of all regions that store the 
column. Does not include bytes necessary to read other columns, like footers. 
Leave null for row-oriented formats (Avro) |
+| _optional_ | _optional_ | _optional_ | **`109  value_counts`**           | 
`map<119: int, 120: long>`                                                  | 
Map from column id to number of values in the column (including null and NaN 
values)                                                                         
                                       |
+| _optional_ | _optional_ | _optional_ | **`110  null_value_counts`**      | 
`map<121: int, 122: long>`                                                  | 
Map from column id to number of null values in the column                       
                                                                                
                                    |
+| _optional_ | _optional_ | _optional_ | **`137  nan_value_counts`**       | 
`map<138: int, 139: long>`                                                  | 
Map from column id to number of NaN values in the column                        
                                                                                
                                    |
+| _optional_ | _optional_ |            | ~~**`111  distinct_counts`**~~    | 
`map<123: int, 124: long>`                                                  | 
**Deprecated. Do not write.**                                                   
                                                                                
                                    |

Review Comment:
   can we make this change without having to update the entire table?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Docs: Deprecate data_file.distinct_counts in v3 [iceberg]

Reply via email to