Re: [PR] Spec: Clarify multi-arg transform behavior for different versions [iceberg]

via GitHub Mon, 26 Feb 2024 14:39:24 -0800


szehon-ho commented on code in PR #9661:
URL: https://github.com/apache/iceberg/pull/9661#discussion_r1503400947



##########
format/spec.md:
##########
@@ -1134,10 +1148,9 @@ In some cases partition specs are stored using only the 
field list instead of th
 
 The `field-id` property was added for each partition field in v2. In v1, the 
reference implementation assigned field ids sequentially in each spec starting 
at 1,000. See Partition Evolution for more details.
 
-Notes:
+Transforms that accept multiple arguments specify source field IDs using 
`source-ids` instead of `source-id`. Writers producing these transforms in v1 
and v2 metadata should additionally produce the `source-id` field by setting it 
to the first ID from the `source-ids` list. Writers producing these transforms 
in v3 metadata should populate only the `source-ids` field because v3 readers 
will fully-support multi-arg transforms by reading this field.

Review Comment:
   Thanks, it is more clear!



##########
format/spec.md:
##########
@@ -1314,6 +1331,24 @@ Default values are added to struct fields in v3.
 
 Types `timestamp_ns` and `timestamptz_ns` are added in v3.
 
+Writing V3 metadata:
+
+* Partition field JSON:
+    * `source-ids` was added and is required
+    * `source-id` is no longer required and should be omitted; always use 
`source-ids` instead
+* Sort order JSON:
+    * `source-ids` was added and is required
+    * `source-id` is no longer required and should be omitted; always use 
`source-ids` instead
+
+Reading older version metadata for V3:
+
+* Partition field and sort order field `source-ids` must default to a 
single-value list of the value of `source-id`
+
+Writing older version metadata:
+
+* For single-arg transforms, partition field and sort order field `source-id` 
should be written; `source-ids` must be omitted

Review Comment:
   Yea you guys are right, it is slightly clearer to write 'source-ids' as a 
single element list for single-arg transform on v1/v2 tables, changed.



##########
format/spec.md:
##########
@@ -301,7 +303,7 @@ Tables are configured with a **partition spec** that 
defines how to produce a tu
 *   A **transform** that is applied to the source column(s) to produce a 
partition value
 *   A **partition name**
 
-The source column, selected by id, must be a primitive type and cannot be 
contained in a map or list, but may be nested in a struct. For details on how 
to serialize a partition spec to JSON, see Appendix C.
+The source columns, selected by ids, must be a primitive type and cannot be 
contained in a map or list, but may be nested in a struct. For details on how 
to serialize a partition spec to JSON, see Appendix C.

Review Comment:
   Makes sense.  iiuc, implementations would optionally support them in v1/v2 
based on a flag, and required to support them in v3.



##########
format/spec.md:
##########
@@ -1314,6 +1330,24 @@ Default values are added to struct fields in v3.
 
 Types `timestamp_ns` and `timestamptz_ns` are added in v3.
 
+Writing V3 metadata:
+
+* Partition field JSON:
+    * `source-ids` was added and is required
+    * `source-id` is no longer required and should be omitted; always use 
`source-ids` instead
+* Sort order JSON:
+    * `source-ids` was added and is required
+    * `source-id` is no longer required and should be omitted; always use 
`source-ids` instead
+
+Reading older version metadata for V3:
+
+* Partition field and sort order field `source-ids` must default to a 
single-value list of the value of `source-id`
+
+Writing older version metadata:

Review Comment:
   Yes, fixed.



##########
format/spec.md:
##########
@@ -1128,16 +1142,14 @@ Each partition field in the fields list is stored as an 
object. See the table fo
 |**`month`**|`JSON string: "month"`|`"month"`|
 |**`day`**|`JSON string: "day"`|`"day"`|
 |**`hour`**|`JSON string: "hour"`|`"hour"`|
-|**`Partition Field`** [1,2]|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": 
<id int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br 
/>&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": 
<transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br 
/>&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br 
/>&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`}`|
 
 In some cases partition specs are stored using only the field list instead of 
the object format that includes the spec ID, like the deprecated 
`partition-spec` field in table metadata. The object format should be used 
unless otherwise noted in this spec.
 
 The `field-id` property was added for each partition field in v2. In v1, the 
reference implementation assigned field ids sequentially in each spec starting 
at 1,000. See Partition Evolution for more details.
 
-Notes:
+Transforms that accept multiple arguments specify source field IDs using 
`source-ids` instead of `source-id`. Writers producing these transforms in v1 
and v2 metadata should additionally produce the `source-id` field by setting it 
to the first ID from the `source-ids` list. Writers producing these transforms 
in v3 metadata should populate only the `source-ids` field because v3 readers 
will fully-support multi-arg transforms by reading this field.
 
-1. For partition fields with a transform with a single argument, the ID of the 
source field is set on `source-id`, and `source-ids` is omitted.
-2. For partition fields with a transform of multiple arguments, the IDs of the 
source fields are set on `source-ids`. To preserve backward compatibility, 
`source-id` is set to -1.
+Older versions of the reference implementation can read tables with transforms 
unknown to it, without the ability to push down filters or write. But other 
implementations may break if they encounter unknown transforms.

Review Comment:
   Done, added here and also in Appendix E



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Clarify multi-arg transform behavior for different versions [iceberg]

Reply via email to