Re: [PR] Spec: Allow the use of `source-id` in V3 [iceberg]

via GitHub Wed, 26 Mar 2025 18:04:28 -0700


sfc-gh-bhannel commented on code in PR #12644:
URL: https://github.com/apache/iceberg/pull/12644#discussion_r2014937438



##########
format/spec.md:
##########
@@ -1414,12 +1414,16 @@ Each partition field in `fields` is stored as a JSON 
object with the following p
 
 | V1       | V2       | V3       | Field            | JSON representation | 
Example      |
 
|----------|----------|----------|------------------|---------------------|--------------|
-| required | required | omitted  | **`source-id`**  | `JSON int`          | 1  
          |
-|          |          | required | **`source-ids`** | `JSON list of ints` | 
`[1,2]`      |
+| required | required | required¹ | **`source-id`**  | `JSON int`          | 1 
           |
+|          |          | required¹ | **`source-ids`** | `JSON list of ints` | 
`[1,2]`      |
 |          | required | required | **`field-id`**   | `JSON int`          | 
1000         |
 | required | required | required | **`name`**       | `JSON string`       | 
`id_bucket`  |
 | required | required | required | **`transform`**  | `JSON string`       | 
`bucket[16]` |
 
+Notes:
+
+1. For partition fields with a transform with a single argument, the ID of the 
source field is set on `source-id`, and `source-ids` is omitted.

Review Comment:
   Would it be more consistent to always use source-ids for v3 writers? Or do 
you want to use source-id to allow v3 writers to be compatible with v2 readers 
if you don't use v3 features?



##########
format/spec.md:
##########
@@ -1453,13 +1457,15 @@ Each sort field in the fields list is stored as an 
object with the following pro
 
 | V1       | V2       | V3       | Field            | JSON representation | 
Example     |
 
|----------|----------|----------|------------------|---------------------|-------------|
-| required | required | required | **`transform`**  | `JSON string`       | 
`bucket[4]` |
-| required | required | omitted  | **`source-id`**  | `JSON int`          | 1  
         |
+| required | required | required¹ | **`transform`**  | `JSON string`       | 
`bucket[4]` |
+| required | required | required¹ | **`source-id`**  | `JSON int`          | 1 
          |
 |          |          | required | **`source-ids`** | `JSON list of ints` | 
`[1,2]`     |
 | required | required | required | **`direction`**  | `JSON string`       | 
`asc`       |
 | required | required | required | **`null-order`** | `JSON string`       | 
`nulls-last`|
 
-In v3 metadata, writers must use only `source-ids` because v3 requires reader 
support for multi-arg transforms.
+Notes:
+
+1. For sort fields with a transform with a single argument, the ID of the 
source field is set on `source-id`, and `source-ids` is omitted.

Review Comment:
   Did you mean to make the `transform` row refer to this footnote? I would 
expect it to be the `source-id` and `source-ids` rows.



##########
format/spec.md:
##########
@@ -540,7 +540,7 @@ Notes:
 2. The width, `W`, used to truncate decimal values is applied using the scale 
of the decimal column to avoid additional (and potentially conflicting) 
parameters.
 3. Strings are truncated to a valid UTF-8 string with no more than `L` code 
points.
 4. In contrast to strings, binary values do not have an assumed encoding and 
are truncated to `L` bytes.
-
+5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ 
hash(col2) ⊕ ... ⊕ hash(colN)) % W`.

Review Comment:
   I think it would be good to be explicit about how to handle nulls and 
dropped columns. Should it be calculated as if `hash(null) == 0`?
   
   It also might be worth adding parentheses for those who don't recall the 
precedence order between XOR and modulo.
   
   Also - good to meet you, and thanks for working on this! I'm a developer at 
Snowflake working on Iceberg functionality, and we really appreciate how 
detailed the spec is!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Allow the use of `source-id` in V3 [iceberg]

Reply via email to