advancedxy commented on code in PR #8579:
URL: https://github.com/apache/iceberg/pull/8579#discussion_r1448572967
##########
format/spec.md:
##########
@@ -1043,21 +1059,29 @@ Partition specs are serialized as a JSON object with
the following fields:
Each partition field in the fields list is stored as an object. See the table
for more detail:
-|Transform or Field|JSON representation|Example|
-|--- |--- |--- |
-|**`identity`**|`JSON string: "identity"`|`"identity"`|
-|**`bucket[N]`**|`JSON string: "bucket[<N>]"`|`"bucket[16]"`|
-|**`truncate[W]`**|`JSON string: "truncate[<W>]"`|`"truncate[20]"`|
-|**`year`**|`JSON string: "year"`|`"year"`|
-|**`month`**|`JSON string: "month"`|`"month"`|
-|**`day`**|`JSON string: "day"`|`"day"`|
-|**`hour`**|`JSON string: "hour"`|`"hour"`|
-|**`Partition Field`**|`JSON object: {`<br /> `"source-id": <id
int>,`<br /> `"field-id": <field id int>,`<br /> `"name":
<name string>,`<br /> `"transform": <transform JSON>`<br
/>`}`|`{`<br /> `"source-id": 1,`<br /> `"field-id":
1000,`<br /> `"name": "id_bucket",`<br /> `"transform":
"bucket[16]"`<br />`}`|
+| Transform or Field | JSON representation
| Example
|
+|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **`identity`** | `JSON string:
"identity"`
|
`"identity"`
|
+| **`bucket[N]`** | `JSON string:
"bucket[<N>]"`
|
`"bucket[16]"`
|
+| **`bucket[N]`** (multi-arg bucket [1]) | `JSON string:
"bucketV2[<N>]"`
|
`"bucketV2[16]"`
|
+| **`truncate[W]`** | `JSON string:
"truncate[<W>]"`
|
`"truncate[20]"`
|
+| **`year`** | `JSON string: "year"`
| `"year"`
|
+| **`month`** | `JSON string: "month"`
| `"month"`
|
+| **`day`** | `JSON string: "day"`
| `"day"`
|
+| **`hour`** | `JSON string: "hour"`
| `"hour"`
|
+| **`Partition Field`** | `JSON object: {`<br
/> `"source-id": <id int>,`<br /> `"field-id": <field id
int>,`<br /> `"name": <name string>,`<br /> `"transform":
<transform JSON>`<br />`}` | `{`<br
/> `"source-id": 1,`<br /> `"field-id": 1000,`<br
/> `"name": "id_bucket",`<br /> `"transform":
"bucket[16]"`<br />`}` |
+| **`Partition Field with multi-arg transform`** [2] | `JSON object: {`<br
/> `"source-id": -1,`<br /> `"source-ids": <list of
ids>,`<br /> `"field-id": <field id int>,`<br /> `"name":
<name string>,`<br /> `"transform": <transform JSON>`<br />`}` |
`{`<br /> `"source-id": -1,`<br /> `"source-ids":
[1,2],`<br /> `"field-id": 1000,`<br /> `"name":
"id_type_bucket",`<br /> `"transform": "bucketV2[16]"`<br />`}` |
In some cases partition specs are stored using only the field list instead of
the object format that includes the spec ID, like the deprecated
`partition-spec` field in table metadata. The object format should be used
unless otherwise noted in this spec.
The `field-id` property was added for each partition field in v2. In v1, the
reference implementation assigned field ids sequentially in each spec starting
at 1,000. See Partition Evolution for more details.
+Notes:
+
+1. For multi-arg bucket, the serialized form is `bucketV2[N]` instead of
`bucket[N]` to distinguish it from the single-arg bucket transform. Therefore,
old readers/writers will identifier this transform as an unknown transform, old
writer will stop writing the table if it encounters this transform, but old
readers would still be able to read the table by scanning all the partitions.
+ This makes adding multi-arg transform a forward-compatible change, but not
a backward-compatible change.
+2. For partition fields with multi-arg transform, `source-id` is replaced by
`source-ids` and marked as `-1` to be consistent with single-arg transform.
`source-id` should still be emitted for single-arg transform.
Review Comment:
I believe we have to omit `source-id` for multi-arg transform. Otherwise,
old version of Iceberg cannot identify the multi-arg transform as an unknown
transform.
The very old version of `PartitionSpecParser` assume `source-id` existed and
would throw an exception when reading tables with multi-arg transform.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]