sfc-gh-bhannel commented on code in PR #12644: URL: https://github.com/apache/iceberg/pull/12644#discussion_r2014937438
########## format/spec.md: ########## @@ -1414,12 +1414,16 @@ Each partition field in `fields` is stored as a JSON object with the following p | V1 | V2 | V3 | Field | JSON representation | Example | |----------|----------|----------|------------------|---------------------|--------------| -| required | required | omitted | **`source-id`** | `JSON int` | 1 | -| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` | +| required | required | required¹ | **`source-id`** | `JSON int` | 1 | +| | | required¹ | **`source-ids`** | `JSON list of ints` | `[1,2]` | | | required | required | **`field-id`** | `JSON int` | 1000 | | required | required | required | **`name`** | `JSON string` | `id_bucket` | | required | required | required | **`transform`** | `JSON string` | `bucket[16]` | +Notes: + +1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted. Review Comment: Would it be more consistent to always use source-ids for v3 writers? Or do you want to use source-id to allow v3 writers to be compatible with v2 readers if you don't use v3 features? ########## format/spec.md: ########## @@ -1453,13 +1457,15 @@ Each sort field in the fields list is stored as an object with the following pro | V1 | V2 | V3 | Field | JSON representation | Example | |----------|----------|----------|------------------|---------------------|-------------| -| required | required | required | **`transform`** | `JSON string` | `bucket[4]` | -| required | required | omitted | **`source-id`** | `JSON int` | 1 | +| required | required | required¹ | **`transform`** | `JSON string` | `bucket[4]` | +| required | required | required¹ | **`source-id`** | `JSON int` | 1 | | | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` | | required | required | required | **`direction`** | `JSON string` | `asc` | | required | required | required | **`null-order`** | `JSON string` | `nulls-last`| -In v3 metadata, writers must use only `source-ids` because v3 requires reader support for multi-arg transforms. +Notes: + +1. For sort fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted. Review Comment: Did you mean to make the `transform` row refer to this footnote? I would expect it to be the `source-id` and `source-ids` rows. ########## format/spec.md: ########## @@ -540,7 +540,7 @@ Notes: 2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters. 3. Strings are truncated to a valid UTF-8 string with no more than `L` code points. 4. In contrast to strings, binary values do not have an assumed encoding and are truncated to `L` bytes. - +5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. Review Comment: I think it would be good to be explicit about how to handle nulls and dropped columns. Should it be calculated as if `hash(null) == 0`? It also might be worth adding parentheses for those who don't recall the precedence order between XOR and modulo. Also - good to meet you, and thanks for working on this! I'm a developer at Snowflake working on Iceberg functionality, and we really appreciate how detailed the spec is! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org