Re: [PR] Spec: add multi-arg transform support [iceberg]

via GitHub Mon, 15 Jan 2024 22:44:15 -0800


szehon-ho commented on code in PR #8579:
URL: https://github.com/apache/iceberg/pull/8579#discussion_r1452981468



##########
format/spec.md:
##########
@@ -1119,21 +1156,30 @@ Partition specs are serialized as a JSON object with 
the following fields:
 
 Each partition field in the fields list is stored as an object. See the table 
for more detail:
 
-|Transform or Field|JSON representation|Example|
-|--- |--- |--- |
-|**`identity`**|`JSON string: "identity"`|`"identity"`|
-|**`bucket[N]`**|`JSON string: "bucket[<N>]"`|`"bucket[16]"`|
-|**`truncate[W]`**|`JSON string: "truncate[<W>]"`|`"truncate[20]"`|
-|**`year`**|`JSON string: "year"`|`"year"`|
-|**`month`**|`JSON string: "month"`|`"month"`|
-|**`day`**|`JSON string: "day"`|`"day"`|
-|**`hour`**|`JSON string: "hour"`|`"hour"`|
-|**`Partition Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": <id 
int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": 
<name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br 
/>`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br />&nbsp;&nbsp;`"field-id": 
1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br />&nbsp;&nbsp;`"transform": 
"bucket[16]"`<br />`}`|
+| Transform or Field                                 | JSON representation     
                                                                                
                                                                                
                                                                | Example       
                                                                                
                                                                                
                                            |
+|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **`identity`**                                     | `JSON string: 
"identity"`                                                                     
                                                                                
                                                                          | 
`"identity"`                                                                    
                                                                                
                                                          |
+| **`bucket[N]`**                                    | `JSON string: 
"bucket[<N>]"`                                                                  
                                                                                
                                                                          | 
`"bucket[16]"`                                                                  
                                                                                
                                                          |
+| **`bucket[N]`** (multi-arg bucket [1])             | `JSON string: 
"bucketV2[<N>]"`                                                                
                                                                                
                                                                          | 
`"bucketV2[16]"`                                                                
                                                                                
                                                          |

Review Comment:
   should the name be bucketv2 then?



##########
format/spec.md:
##########
@@ -1119,21 +1156,30 @@ Partition specs are serialized as a JSON object with 
the following fields:
 
 Each partition field in the fields list is stored as an object. See the table 
for more detail:
 
-|Transform or Field|JSON representation|Example|
-|--- |--- |--- |
-|**`identity`**|`JSON string: "identity"`|`"identity"`|
-|**`bucket[N]`**|`JSON string: "bucket[<N>]"`|`"bucket[16]"`|
-|**`truncate[W]`**|`JSON string: "truncate[<W>]"`|`"truncate[20]"`|
-|**`year`**|`JSON string: "year"`|`"year"`|
-|**`month`**|`JSON string: "month"`|`"month"`|
-|**`day`**|`JSON string: "day"`|`"day"`|
-|**`hour`**|`JSON string: "hour"`|`"hour"`|
-|**`Partition Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": <id 
int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": 
<name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br 
/>`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br />&nbsp;&nbsp;`"field-id": 
1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br />&nbsp;&nbsp;`"transform": 
"bucket[16]"`<br />`}`|
+| Transform or Field                                 | JSON representation     
                                                                                
                                                                                
                                                                | Example       
                                                                                
                                                                                
                                            |
+|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **`identity`**                                     | `JSON string: 
"identity"`                                                                     
                                                                                
                                                                          | 
`"identity"`                                                                    
                                                                                
                                                          |
+| **`bucket[N]`**                                    | `JSON string: 
"bucket[<N>]"`                                                                  
                                                                                
                                                                          | 
`"bucket[16]"`                                                                  
                                                                                
                                                          |
+| **`bucket[N]`** (multi-arg bucket [1])             | `JSON string: 
"bucketV2[<N>]"`                                                                
                                                                                
                                                                          | 
`"bucketV2[16]"`                                                                
                                                                                
                                                          |
+| **`truncate[W]`**                                  | `JSON string: 
"truncate[<W>]"`                                                                
                                                                                
                                                                          | 
`"truncate[20]"`                                                                
                                                                                
                                                          |
+| **`year`**                                         | `JSON string: "year"`   
                                                                                
                                                                                
                                                                | `"year"`      
                                                                                
                                                                                
                                            |
+| **`month`**                                        | `JSON string: "month"`  
                                                                                
                                                                                
                                                                | `"month"`     
                                                                                
                                                                                
                                            |
+| **`day`**                                          | `JSON string: "day"`    
                                                                                
                                                                                
                                                                | `"day"`       
                                                                                
                                                                                
                                            |
+| **`hour`**                                         | `JSON string: "hour"`   
                                                                                
                                                                                
                                                                | `"hour"`      
                                                                                
                                                                                
                                            |
+| **`Partition Field`**                              | `JSON object: {`<br 
/>&nbsp;&nbsp;`"source-id": <id int>,`<br />&nbsp;&nbsp;`"field-id": <field id 
int>,`<br />&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": 
<transform JSON>`<br />`}`                                           | `{`<br 
/>&nbsp;&nbsp;`"source-id": 1,`<br />&nbsp;&nbsp;`"field-id": 1000,`<br 
/>&nbsp;&nbsp;`"name": "id_bucket",`<br />&nbsp;&nbsp;`"transform": 
"bucket[16]"`<br />`}`                                                 |
+| **`Partition Field with multi-arg transform`** [2] | `JSON object: {`<br 
/>&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": <list of 
ids>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": 
<name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br />`}` | 
`{`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": 
[1,2],`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": 
"id_type_bucket",`<br />&nbsp;&nbsp;`"transform": "bucketV2[16]"`<br />`}` |
 
 In some cases partition specs are stored using only the field list instead of 
the object format that includes the spec ID, like the deprecated 
`partition-spec` field in table metadata. The object format should be used 
unless otherwise noted in this spec.
 
 The `field-id` property was added for each partition field in v2. In v1, the 
reference implementation assigned field ids sequentially in each spec starting 
at 1,000. See Partition Evolution for more details.
 
+Notes:
+
+1. For multi-arg bucket, the serialized form is `bucketV2[N]` instead of 
`bucket[N]` to distinguish it from the single-arg bucket transform. Therefore, 
old readers/writers will identify this transform as an unknown transform, old 
writer will stop writing the table if it encounters this transform, but old 
readers would still be able to read the table by scanning all the partitions.

Review Comment:
   I actually think we can remove this note.  I feel there is nothing here that 
is specific to bucketv2 transform, but rather the behavior of existing 
readers/writers to ignore unknown transforms.



##########
format/spec.md:
##########
@@ -314,7 +314,7 @@ Partition field IDs must be reused if an existing partition 
spec contains an equ
 | Transform name    | Description                                              
    | Source types                                                              
                                | Result type |
 
|-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
 | **`identity`**    | Source value, unmodified                                 
    | Any                                                                       
                                | Source type |
-| **`bucket[N]`**   | Hash of value, mod `N` (see below)                       
    | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, 
`timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int`   
    |
+| **`bucket[N]`**   | Hash of value(s), mod `N` (see below)                    
    | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, 
`timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int`   
    |

Review Comment:
   I am ok with adding bucketv2 here.  Let see what @aokolnychyi @rdblue think 



##########
format/spec.md:
##########
@@ -1149,6 +1195,12 @@ Each sort field in the fields list is stored as an 
object with the following pro
 |--- |--- |--- |
 |**`Sort Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform 
JSON>,`<br />&nbsp;&nbsp;`"source-id": <source id int>,`<br 
/>&nbsp;&nbsp;`"direction": <direction string>,`<br 
/>&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}`|`{`<br 
/>&nbsp;&nbsp;`  "transform": "bucket[4]",`<br />&nbsp;&nbsp;`  "source-id": 
3,`<br />&nbsp;&nbsp;`  "direction": "desc",`<br />&nbsp;&nbsp;`  "null-order": 
"nulls-last"`<br />`}`|
 
+Similar with partition fields, sort fields could also contain multi source-ids 
for sorting:
+
+| Field                                 | JSON representation                  
                                                                                
                                                                                
                                                                    | Example   
                                                                                
                                                                                
                                                             |
+|---------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **`Sort Field(multi-arg transform)`** | `JSON object: {`<br 
/>&nbsp;&nbsp;`"transform": <transform JSON>,`<br />&nbsp;&nbsp;`"source-id": 
-1,`<br />&nbsp;&nbsp;`"source-ids": <list of ids>,`<br 
/>&nbsp;&nbsp;`"direction": <direction string>,`<br 
/>&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}` | `{`<br 
/>&nbsp;&nbsp;`  "transform": "bucketV2[4]",`<br />&nbsp;&nbsp;`  "source-id": 
-1,`<br />&nbsp;&nbsp;`  "source-id": [1,2],`<br />&nbsp;&nbsp;`  "direction": 
"desc",`<br />&nbsp;&nbsp;`  "null-order": "nulls-last"`<br />`}` |

Review Comment:
   should we add Notes section and also add the note from partition field here 
(about when to emit and omit `source-id` and `source-ids`?



##########
format/spec.md:
##########
@@ -1149,6 +1195,12 @@ Each sort field in the fields list is stored as an 
object with the following pro
 |--- |--- |--- |
 |**`Sort Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform 
JSON>,`<br />&nbsp;&nbsp;`"source-id": <source id int>,`<br 
/>&nbsp;&nbsp;`"direction": <direction string>,`<br 
/>&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}`|`{`<br 
/>&nbsp;&nbsp;`  "transform": "bucket[4]",`<br />&nbsp;&nbsp;`  "source-id": 
3,`<br />&nbsp;&nbsp;`  "direction": "desc",`<br />&nbsp;&nbsp;`  "null-order": 
"nulls-last"`<br />`}`|
 
+Similar with partition fields, sort fields could also contain multi source-ids 
for sorting:

Review Comment:
   Going with the more formal nature of the spec, I would suggest to not 
abbreviate here: 
   
   Maybe something like: 'sort order fields may also refer to multiple source 
fields'



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: add multi-arg transform support [iceberg]

Reply via email to