Re: [PR] Spec: add multi-arg transform support [iceberg]

via GitHub Mon, 08 Jan 2024 22:42:41 -0800


szehon-ho commented on code in PR #8579:
URL: https://github.com/apache/iceberg/pull/8579#discussion_r1445563921



##########
format/spec.md:
##########
@@ -322,14 +322,22 @@ The `void` transform may be used to replace the transform 
in an existing partiti
 
 #### Bucket Transform Details
 
-Bucket partition transforms use a 32-bit hash of the source value. The 32-bit 
hash implementation is the 32-bit Murmur3 hash, x86 variant, seeded with 0.
+Bucket partition transforms use a 32-bit hash of the source value or source 
value list. The 32-bit hash implementation is the 32-bit Murmur3 hash, x86 
variant, seeded with 0.
 
 Transforms are parameterized by a number of buckets [1], `N`. The hash mod `N` 
must produce a positive value by first discarding the sign bit of the hash 
value. In pseudo-code, the function is:
 
 ```
   def bucket_N(x) = (murmur3_x86_32_hash(x) & Integer.MAX_VALUE) % N
 ```
 
+When bucket transforming a list of values(a.k.a. multi-arg bucket), the input 
is treated as a struct. The struct fields are hashed and the hashes are 
combined using the same hash function. In pseudo-code, the hash function is:
+
+```
+  def murmur3_x86_hash(struct(x1, x2, ..., xn)) = 
hasher.put(x1).put(x2)...put(xn).hash().asInt

Review Comment:
   Are we missing 32_hash?
   
   Also, my issue is I am not sure how non-Java implementations can use this to 
implement the bucket transform.  Is it possible to specify the pseudo-code in a 
more common way, ex just using murmur3_x86_32_hash function?



##########
format/spec.md:
##########
@@ -322,14 +322,22 @@ The `void` transform may be used to replace the transform 
in an existing partiti
 
 #### Bucket Transform Details
 
-Bucket partition transforms use a 32-bit hash of the source value. The 32-bit 
hash implementation is the 32-bit Murmur3 hash, x86 variant, seeded with 0.
+Bucket partition transforms use a 32-bit hash of the source value or source 
value list. The 32-bit hash implementation is the 32-bit Murmur3 hash, x86 
variant, seeded with 0.

Review Comment:
   What do you think about ```source value(s).```
   
   Could also apply above too.



##########
format/spec.md:
##########
@@ -986,6 +994,14 @@ The types below are not currently valid for bucketing, and 
so are not hashed. Ho
 | **`float`**        | `hashLong(doubleToLongBits(double(v))` [4]| `1.0F` ￫ 
`-142385009`, `0.0F` ￫ `1669671676`, `-0.0F` ￫ `1669671676` |
 | **`double`**       | `hashLong(doubleToLongBits(v))`        [4]| `1.0D` ￫ 
`-142385009`, `0.0D` ￫ `1669671676`, `-0.0D` ￫ `1669671676` |
 
+For multi-arg hash function, the hash value is the result of `hashBytes` on 
the concatenation of the bytes of the arguments in order, null input values are 
ignored.

Review Comment:
   Again, this seems too specific to Java implementation.



##########
format/spec.md:
##########
@@ -1043,21 +1059,29 @@ Partition specs are serialized as a JSON object with 
the following fields:
 
 Each partition field in the fields list is stored as an object. See the table 
for more detail:
 
-|Transform or Field|JSON representation|Example|
-|--- |--- |--- |
-|**`identity`**|`JSON string: "identity"`|`"identity"`|
-|**`bucket[N]`**|`JSON string: "bucket[<N>]"`|`"bucket[16]"`|
-|**`truncate[W]`**|`JSON string: "truncate[<W>]"`|`"truncate[20]"`|
-|**`year`**|`JSON string: "year"`|`"year"`|
-|**`month`**|`JSON string: "month"`|`"month"`|
-|**`day`**|`JSON string: "day"`|`"day"`|
-|**`hour`**|`JSON string: "hour"`|`"hour"`|
-|**`Partition Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": <id 
int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": 
<name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br 
/>`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br />&nbsp;&nbsp;`"field-id": 
1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br />&nbsp;&nbsp;`"transform": 
"bucket[16]"`<br />`}`|
+| Transform or Field                                 | JSON representation     
                                                                                
                                                                                
                                                                | Example       
                                                                                
                                                                                
                                            |
+|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **`identity`**                                     | `JSON string: 
"identity"`                                                                     
                                                                                
                                                                          | 
`"identity"`                                                                    
                                                                                
                                                          |
+| **`bucket[N]`**                                    | `JSON string: 
"bucket[<N>]"`                                                                  
                                                                                
                                                                          | 
`"bucket[16]"`                                                                  
                                                                                
                                                          |
+| **`bucket[N]`** (multi-arg bucket [1])             | `JSON string: 
"bucketV2[<N>]"`                                                                
                                                                                
                                                                          | 
`"bucketV2[16]"`                                                                
                                                                                
                                                          |
+| **`truncate[W]`**                                  | `JSON string: 
"truncate[<W>]"`                                                                
                                                                                
                                                                          | 
`"truncate[20]"`                                                                
                                                                                
                                                          |
+| **`year`**                                         | `JSON string: "year"`   
                                                                                
                                                                                
                                                                | `"year"`      
                                                                                
                                                                                
                                            |
+| **`month`**                                        | `JSON string: "month"`  
                                                                                
                                                                                
                                                                | `"month"`     
                                                                                
                                                                                
                                            |
+| **`day`**                                          | `JSON string: "day"`    
                                                                                
                                                                                
                                                                | `"day"`       
                                                                                
                                                                                
                                            |
+| **`hour`**                                         | `JSON string: "hour"`   
                                                                                
                                                                                
                                                                | `"hour"`      
                                                                                
                                                                                
                                            |
+| **`Partition Field`**                              | `JSON object: {`<br 
/>&nbsp;&nbsp;`"source-id": <id int>,`<br />&nbsp;&nbsp;`"field-id": <field id 
int>,`<br />&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": 
<transform JSON>`<br />`}`                                           | `{`<br 
/>&nbsp;&nbsp;`"source-id": 1,`<br />&nbsp;&nbsp;`"field-id": 1000,`<br 
/>&nbsp;&nbsp;`"name": "id_bucket",`<br />&nbsp;&nbsp;`"transform": 
"bucket[16]"`<br />`}`                                                 |
+| **`Partition Field with multi-arg transform`** [2] | `JSON object: {`<br 
/>&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": <list of 
ids>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": 
<name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br />`}` | 
`{`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": 
[1,2],`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": 
"id_type_bucket",`<br />&nbsp;&nbsp;`"transform": "bucketV2[16]"`<br />`}` |
 
 In some cases partition specs are stored using only the field list instead of 
the object format that includes the spec ID, like the deprecated 
`partition-spec` field in table metadata. The object format should be used 
unless otherwise noted in this spec.
 
 The `field-id` property was added for each partition field in v2. In v1, the 
reference implementation assigned field ids sequentially in each spec starting 
at 1,000. See Partition Evolution for more details.
 
+Notes:
+
+1. For multi-arg bucket, the serialized form is `bucketV2[N]` instead of 
`bucket[N]` to distinguish it from the single-arg bucket transform. Therefore, 
old readers/writers will identifier this transform as an unknown transform, old 
writer will stop writing the table if it encounters this transform, but old 
readers would still be able to read the table by scanning all the partitions.
+   This makes adding multi-arg transform a forward-compatible change, but not 
a backward-compatible change.
+2. For partition fields with multi-arg transform, `source-id` is replaced by 
`source-ids` and marked as `-1` to be consistent with single-arg transform. 
`source-id` should still be emitted for single-arg transform.

Review Comment:
   Do we allow both to be specified or only one at a time?



##########
format/spec.md:
##########
@@ -322,14 +322,22 @@ The `void` transform may be used to replace the transform 
in an existing partiti
 
 #### Bucket Transform Details
 
-Bucket partition transforms use a 32-bit hash of the source value. The 32-bit 
hash implementation is the 32-bit Murmur3 hash, x86 variant, seeded with 0.
+Bucket partition transforms use a 32-bit hash of the source value or source 
value list. The 32-bit hash implementation is the 32-bit Murmur3 hash, x86 
variant, seeded with 0.
 
 Transforms are parameterized by a number of buckets [1], `N`. The hash mod `N` 
must produce a positive value by first discarding the sign bit of the hash 
value. In pseudo-code, the function is:
 
 ```
   def bucket_N(x) = (murmur3_x86_32_hash(x) & Integer.MAX_VALUE) % N
 ```
 
+When bucket transforming a list of values(a.k.a. multi-arg bucket), the input 
is treated as a struct. The struct fields are hashed and the hashes are 
combined using the same hash function. In pseudo-code, the hash function is:

Review Comment:
   Nits: 
   1. When bucket transform is applied on a a list of values (seems a bit more 
natural than bucket-transforming)
   2. space needed before first parens.



##########
format/spec.md:
##########
@@ -376,8 +384,8 @@ Users can sort their data within partitions by columns to 
gain performance. The
 
 A sort order is defined by a sort order id and a list of sort fields. The 
order of the sort fields within the list defines the order in which the sort is 
applied to the data. Each sort field consists of:
 
-*   A **source column id** from the table's schema
-*   A **transform** that is used to produce values to be sorted on from the 
source column. This is the same transform as described in [partition 
transforms](#partition-transforms).
+*   A **source column id** or a list of **source column ids** from the table's 
schema
+*   A **transform** that is used to produce values to be sorted on from the 
source column or columns. This is the same transform as described in [partition 
transforms](#partition-transforms).

Review Comment:
   Optional, can also put column(s) here.



##########
format/spec.md:
##########
@@ -1073,6 +1097,12 @@ Each sort field in the fields list is stored as an 
object with the following pro
 |--- |--- |--- |
 |**`Sort Field`**|`JSON object: {`<br />&nbsp;&nbsp;`"transform": <transform 
JSON>,`<br />&nbsp;&nbsp;`"source-id": <source id int>,`<br 
/>&nbsp;&nbsp;`"direction": <direction string>,`<br 
/>&nbsp;&nbsp;`"null-order": <null-order string>`<br />`}`|`{`<br 
/>&nbsp;&nbsp;`  "transform": "bucket[4]",`<br />&nbsp;&nbsp;`  "source-id": 
3,`<br />&nbsp;&nbsp;`  "direction": "desc",`<br />&nbsp;&nbsp;`  "null-order": 
"nulls-last"`<br />`}`|

Review Comment:
   Should we just change this to include source-ids?



##########
format/spec.md:
##########
@@ -322,14 +322,22 @@ The `void` transform may be used to replace the transform 
in an existing partiti
 
 #### Bucket Transform Details
 
-Bucket partition transforms use a 32-bit hash of the source value. The 32-bit 
hash implementation is the 32-bit Murmur3 hash, x86 variant, seeded with 0.
+Bucket partition transforms use a 32-bit hash of the source value or source 
value list. The 32-bit hash implementation is the 32-bit Murmur3 hash, x86 
variant, seeded with 0.
 
 Transforms are parameterized by a number of buckets [1], `N`. The hash mod `N` 
must produce a positive value by first discarding the sign bit of the hash 
value. In pseudo-code, the function is:
 
 ```
   def bucket_N(x) = (murmur3_x86_32_hash(x) & Integer.MAX_VALUE) % N
 ```
 
+When bucket transforming a list of values(a.k.a. multi-arg bucket), the input 
is treated as a struct. The struct fields are hashed and the hashes are 
combined using the same hash function. In pseudo-code, the hash function is:
+
+```
+  def murmur3_x86_hash(struct(x1, x2, ..., xn)) = 
hasher.put(x1).put(x2)...put(xn).hash().asInt
+```
+
+This function produces the same result for a single field struct as hash the 
single value. Null input in the struct is ignored in the hash function.

Review Comment:
   I feel these can go in the notes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: add multi-arg transform support [iceberg]

Reply via email to