Re: [PR] Spec: Clarify multi-arg transform behavior for different versions [iceberg]

via GitHub Mon, 12 Feb 2024 22:08:29 -0800


szehon-ho commented on code in PR #9661:
URL: https://github.com/apache/iceberg/pull/9661#discussion_r1487243929



##########
format/spec.md:
##########
@@ -1130,14 +1142,10 @@ Each partition field in the fields list is stored as an 
object. See the table fo
 |**`hour`**|`JSON string: "hour"`|`"hour"`|
 |**`Partition Field`** [1,2]|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": 
<id int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br 
/>&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": 
<transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br 
/>&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br 
/>&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`}`|
 
-In some cases partition specs are stored using only the field list instead of 
the object format that includes the spec ID, like the deprecated 
`partition-spec` field in table metadata. The object format should be used 
unless otherwise noted in this spec.
-
-The `field-id` property was added for each partition field in v2. In v1, the 
reference implementation assigned field ids sequentially in each spec starting 
at 1,000. See Partition Evolution for more details.
-
 Notes:
-
-1. For partition fields with a transform with a single argument, the ID of the 
source field is set on `source-id`, and `source-ids` is omitted.
-2. For partition fields with a transform of multiple arguments, the IDs of the 
source fields are set on `source-ids`. To preserve backward compatibility, 
`source-id` is set to -1.
+1. In some cases partition specs are stored using only the field list instead 
of the object format that includes the spec ID, like the deprecated 
`partition-spec` field in table metadata. The object format should be used 
unless otherwise noted in this spec.
+2. The `field-id` property was added for each partition field in v2. In v1, 
the reference implementation assigned field ids sequentially in each spec 
starting at 1,000. See Partition Evolution for more details.
+3. For tables of version < V3, the ID of the source field of each partition 
field is set in `source-id`. For tables of version >= V3, the ID(s) of the 
source field(s) is set on `source-ids`, and `source-id` is omitted. See 
Appendix E for more details.

Review Comment:
   @rdblue I added these paragraphs.
   
   I added some minor clarification to parts that made me have to read twice.  
Clarified 'writers' to 'writers producing these transforms.' and used 
'additionally' in the V1/V2 case to be more clear it is populated in addition 
to 'source-ids'.  Let me know if that sounds ok.
   
   
   > Transforms that accept multiple arguments specify source field IDs using 
`source-ids` instead of `source-id`. Writers producing these transforms in v1 
and v2 metadata should additionally produce the `source-id` field by setting it 
to the first ID from the `source-ids` list. Writers producing these transforms 
in v3 metadata should populate only the `source-ids` field because v3 readers 
will fully-support multi-arg transforms by reading this field.
   
   
   This sentence actually made me a bit confused:
   > Older versions of the reference implementation can read tables with 
unknown transforms and will ignore multi-arg transforms, but other 
implementations may break if they encounter unknown transform names.
   
   I was thinking to pull out the sentence as it seems its a more general 
statement and the flow is better like that, let me know what you think.  I was 
also thinking it may make sense to just say this for all unknown transforms, 
without having to  mention multi-arg in particular, something like:
   
   > Older versions of the reference implementation can read tables with 
transforms unknown to it, without the ability to push down filters or write.  
But other implementations may break if they encounter unknown transforms.
   
   What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Clarify multi-arg transform behavior for different versions [iceberg]

Reply via email to