szehon-ho commented on code in PR #8579: URL: https://github.com/apache/iceberg/pull/8579#discussion_r1450615797
########## format/spec.md: ########## @@ -329,19 +329,35 @@ The `void` transform may be used to replace the transform in an existing partiti #### Bucket Transform Details -Bucket partition transforms use a 32-bit hash of the source value. The 32-bit hash implementation is the 32-bit Murmur3 hash, x86 variant, seeded with 0. +Bucket partition transforms use a 32-bit hash of the source value(s). The 32-bit hash implementation is the 32-bit Murmur3 hash, x86 variant, seeded with 0. Transforms are parameterized by a number of buckets [1], `N`. The hash mod `N` must produce a positive value by first discarding the sign bit of the hash value. In pseudo-code, the function is: ``` def bucket_N(x) = (murmur3_x86_32_hash(x) & Integer.MAX_VALUE) % N ``` +When bucket transform is applied on a list of values, the input is treated as concatenated bytes of each value. In pseudo-code, the function is: + +``` + def murmur3_x86_32_hashes(x1, x2, x3, ...) = { + byte[] bytes; + for (x in [x1, x2, x3, ...]) { + bytes = x == null ? bytes : append(bytes, bytesOf(x)); Review Comment: What do you think like: ``` if (x != null) bytes.append(byteOf(x)) ``` Or use the original append method is also fine, just was thinking to prevent the = and == being so close to each other. ########## format/spec.md: ########## @@ -1060,6 +1076,14 @@ The types below are not currently valid for bucketing, and so are not hashed. Ho | **`float`** | `hashLong(doubleToLongBits(double(v))` [4]| `1.0F` → `-142385009`, `0.0F` → `1669671676`, `-0.0F` → `1669671676` | | **`double`** | `hashLong(doubleToLongBits(v))` [4]| `1.0D` → `-142385009`, `0.0D` → `1669671676`, `-0.0D` → `1669671676` | +For multi-arg hash function, the hash value is the result of `hashBytes` on the concatenation of the bytes of the arguments. NULL input is ignored. + +| Struct examples | Hash Specification | Test value | +|------------------------------------|---------------------------------------------------------------------------------------|------------------------------------------------------------| Review Comment: I feel like if a table is an appendix table, it should be complete, (cover all types). What do you think about something like? For multiple arguments, hashBytes() is applied on the concatenated byte representation of each argument: | primitive type | byte representation | | int | littleEndianBytes(long(v)) | | ... | ... | For example, the hash representation of (a:int, b:string) will be hashBytes(concatenation(littleEndianBytes(long(v)), utf8Bytes(b)) ########## format/spec.md: ########## @@ -329,19 +329,35 @@ The `void` transform may be used to replace the transform in an existing partiti #### Bucket Transform Details -Bucket partition transforms use a 32-bit hash of the source value. The 32-bit hash implementation is the 32-bit Murmur3 hash, x86 variant, seeded with 0. +Bucket partition transforms use a 32-bit hash of the source value(s). The 32-bit hash implementation is the 32-bit Murmur3 hash, x86 variant, seeded with 0. Transforms are parameterized by a number of buckets [1], `N`. The hash mod `N` must produce a positive value by first discarding the sign bit of the hash value. In pseudo-code, the function is: ``` def bucket_N(x) = (murmur3_x86_32_hash(x) & Integer.MAX_VALUE) % N ``` +When bucket transform is applied on a list of values, the input is treated as concatenated bytes of each value. In pseudo-code, the function is: Review Comment: What do you think about 'the hash is applied on the concatenated byte representations of all values' ########## format/spec.md: ########## @@ -1043,21 +1059,29 @@ Partition specs are serialized as a JSON object with the following fields: Each partition field in the fields list is stored as an object. See the table for more detail: -|Transform or Field|JSON representation|Example| -|--- |--- |--- | -|**`identity`**|`JSON string: "identity"`|`"identity"`| -|**`bucket[N]`**|`JSON string: "bucket[<N>]"`|`"bucket[16]"`| -|**`truncate[W]`**|`JSON string: "truncate[<W>]"`|`"truncate[20]"`| -|**`year`**|`JSON string: "year"`|`"year"`| -|**`month`**|`JSON string: "month"`|`"month"`| -|**`day`**|`JSON string: "day"`|`"day"`| -|**`hour`**|`JSON string: "hour"`|`"hour"`| -|**`Partition Field`**|`JSON object: {`<br /> `"source-id": <id int>,`<br /> `"field-id": <field id int>,`<br /> `"name": <name string>,`<br /> `"transform": <transform JSON>`<br />`}`|`{`<br /> `"source-id": 1,`<br /> `"field-id": 1000,`<br /> `"name": "id_bucket",`<br /> `"transform": "bucket[16]"`<br />`}`| +| Transform or Field | JSON representation | Example | +|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **`identity`** | `JSON string: "identity"` | `"identity"` | +| **`bucket[N]`** | `JSON string: "bucket[<N>]"` | `"bucket[16]"` | +| **`bucket[N]`** (multi-arg bucket [1]) | `JSON string: "bucketV2[<N>]"` | `"bucketV2[16]"` | +| **`truncate[W]`** | `JSON string: "truncate[<W>]"` | `"truncate[20]"` | +| **`year`** | `JSON string: "year"` | `"year"` | +| **`month`** | `JSON string: "month"` | `"month"` | +| **`day`** | `JSON string: "day"` | `"day"` | +| **`hour`** | `JSON string: "hour"` | `"hour"` | +| **`Partition Field`** | `JSON object: {`<br /> `"source-id": <id int>,`<br /> `"field-id": <field id int>,`<br /> `"name": <name string>,`<br /> `"transform": <transform JSON>`<br />`}` | `{`<br /> `"source-id": 1,`<br /> `"field-id": 1000,`<br /> `"name": "id_bucket",`<br /> `"transform": "bucket[16]"`<br />`}` | +| **`Partition Field with multi-arg transform`** [2] | `JSON object: {`<br /> `"source-id": -1,`<br /> `"source-ids": <list of ids>,`<br /> `"field-id": <field id int>,`<br /> `"name": <name string>,`<br /> `"transform": <transform JSON>`<br />`}` | `{`<br /> `"source-id": -1,`<br /> `"source-ids": [1,2],`<br /> `"field-id": 1000,`<br /> `"name": "id_type_bucket",`<br /> `"transform": "bucketV2[16]"`<br />`}` | In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated `partition-spec` field in table metadata. The object format should be used unless otherwise noted in this spec. The `field-id` property was added for each partition field in v2. In v1, the reference implementation assigned field ids sequentially in each spec starting at 1,000. See Partition Evolution for more details. +Notes: + +1. For multi-arg bucket, the serialized form is `bucketV2[N]` instead of `bucket[N]` to distinguish it from the single-arg bucket transform. Therefore, old readers/writers will identifier this transform as an unknown transform, old writer will stop writing the table if it encounters this transform, but old readers would still be able to read the table by scanning all the partitions. + This makes adding multi-arg transform a forward-compatible change, but not a backward-compatible change. +2. For partition fields with multi-arg transform, `source-id` is replaced by `source-ids` and marked as `-1` to be consistent with single-arg transform. `source-id` should still be emitted for single-arg transform. Review Comment: Hm I see, do you think this is clearer? 1. For partition fields with a transform of multiple arguments, the ids of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1. 2. For partition field with a transform with a single argument, the id of the source field is set on `source-id`, and `source-ids` is omitted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org