rdblue commented on code in PR #10831: URL: https://github.com/apache/iceberg/pull/10831#discussion_r1815771516
########## format/spec.md: ########## @@ -1297,54 +1308,56 @@ Example This serialization scheme is for storing single values as individual binary values in the lower and upper bounds maps of manifest files. -| Type | Binary serialization | -|------------------------------|--------------------------------------------------------------------------------------------------------------| -| **`unknown`** | Not supported | -| **`boolean`** | `0x00` for false, non-zero byte for true | -| **`int`** | Stored as 4-byte little-endian | -| **`long`** | Stored as 8-byte little-endian | -| **`float`** | Stored as 4-byte little-endian | -| **`double`** | Stored as 8-byte little-endian | -| **`date`** | Stores days from the 1970-01-01 in an 4-byte little-endian int | -| **`time`** | Stores microseconds from midnight in an 8-byte little-endian long | -| **`timestamp`** | Stores microseconds from 1970-01-01 00:00:00.000000 in an 8-byte little-endian long | -| **`timestamptz`** | Stores microseconds from 1970-01-01 00:00:00.000000 UTC in an 8-byte little-endian long | -| **`timestamp_ns`** | Stores nanoseconds from 1970-01-01 00:00:00.000000000 in an 8-byte little-endian long | -| **`timestamptz_ns`** | Stores nanoseconds from 1970-01-01 00:00:00.000000000 UTC in an 8-byte little-endian long | -| **`string`** | UTF-8 bytes (without length) | -| **`uuid`** | 16-byte big-endian value, see example in Appendix B | -| **`fixed(L)`** | Binary value | -| **`binary`** | Binary value (without length) | -| **`decimal(P, S)`** | Stores unscaled value as two’s-complement big-endian binary, using the minimum number of bytes for the value | -| **`struct`** | Not supported | -| **`list`** | Not supported | -| **`map`** | Not supported | +| Type | Binary serialization | +|----------------------|--------------------------------------------------------------------------------------------------------------| +| **`unknown`** | Not supported | +| **`boolean`** | `0x00` for false, non-zero byte for true | +| **`int`** | Stored as 4-byte little-endian | +| **`long`** | Stored as 8-byte little-endian | +| **`float`** | Stored as 4-byte little-endian | +| **`double`** | Stored as 8-byte little-endian | +| **`date`** | Stores days from the 1970-01-01 in an 4-byte little-endian int | +| **`time`** | Stores microseconds from midnight in an 8-byte little-endian long | +| **`timestamp`** | Stores microseconds from 1970-01-01 00:00:00.000000 in an 8-byte little-endian long | +| **`timestamptz`** | Stores microseconds from 1970-01-01 00:00:00.000000 UTC in an 8-byte little-endian long | +| **`timestamp_ns`** | Stores nanoseconds from 1970-01-01 00:00:00.000000000 in an 8-byte little-endian long | +| **`timestamptz_ns`** | Stores nanoseconds from 1970-01-01 00:00:00.000000000 UTC in an 8-byte little-endian long | +| **`string`** | UTF-8 bytes (without length) | +| **`uuid`** | 16-byte big-endian value, see example in Appendix B | +| **`fixed(L)`** | Binary value | +| **`binary`** | Binary value (without length) | +| **`decimal(P, S)`** | Stores unscaled value as two’s-complement big-endian binary, using the minimum number of bytes for the value | +| **`struct`** | Not supported | +| **`list`** | Not supported | +| **`map`** | Not supported | +| **`variant`** | Not supported | ### JSON single-value serialization Single values are serialized as JSON by type according to the following table: -| Type | JSON representation | Example | Description | -| ------------------ | ----------------------------------------- | ------------------------------------------ | -- | -| **`boolean`** | **`JSON boolean`** | `true` | | -| **`int`** | **`JSON int`** | `34` | | -| **`long`** | **`JSON long`** | `34` | | -| **`float`** | **`JSON number`** | `1.0` | | -| **`double`** | **`JSON number`** | `1.0` | | -| **`decimal(P,S)`** | **`JSON string`** | `"14.20"`, `"2E+20"` | Stores the string representation of the decimal value, specifically, for values with a positive scale, the number of digits to the right of the decimal point is used to indicate scale, for values with a negative scale, the scientific notation is used and the exponent must equal the negated scale | -| **`date`** | **`JSON string`** | `"2017-11-16"` | Stores ISO-8601 standard date | -| **`time`** | **`JSON string`** | `"22:31:08.123456"` | Stores ISO-8601 standard time with microsecond precision | -| **`timestamp`** | **`JSON string`** | `"2017-11-16T22:31:08.123456"` | Stores ISO-8601 standard timestamp with microsecond precision; must not include a zone offset | -| **`timestamptz`** | **`JSON string`** | `"2017-11-16T22:31:08.123456+00:00"` | Stores ISO-8601 standard timestamp with microsecond precision; must include a zone offset and it must be '+00:00' | -| **`timestamp_ns`** | **`JSON string`** | `"2017-11-16T22:31:08.123456789"` | Stores ISO-8601 standard timestamp with nanosecond precision; must not include a zone offset | -| **`timestamptz_ns`** | **`JSON string`** | `"2017-11-16T22:31:08.123456789+00:00"` | Stores ISO-8601 standard timestamp with nanosecond precision; must include a zone offset and it must be '+00:00' | -| **`string`** | **`JSON string`** | `"iceberg"` | | -| **`uuid`** | **`JSON string`** | `"f79c3e09-677c-4bbd-a479-3f349cb785e7"` | Stores the lowercase uuid string | -| **`fixed(L)`** | **`JSON string`** | `"000102ff"` | Stored as a hexadecimal string | -| **`binary`** | **`JSON string`** | `"000102ff"` | Stored as a hexadecimal string | -| **`struct`** | **`JSON object by field ID`** | `{"1": 1, "2": "bar"}` | Stores struct fields using the field ID as the JSON field name; field values are stored using this JSON single-value format | -| **`list`** | **`JSON array of values`** | `[1, 2, 3]` | Stores a JSON array of values that are serialized using this JSON single-value format | -| **`map`** | **`JSON object of key and value arrays`** | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual keys and values are serialized using this JSON single-value format | +| Type | JSON representation | Example | Description | +|----------------------|--------------------------------------------------------------|--------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **`boolean`** | **`JSON boolean`** | `true` | | +| **`int`** | **`JSON int`** | `34` | | +| **`long`** | **`JSON long`** | `34` | | +| **`float`** | **`JSON number`** | `1.0` | | +| **`double`** | **`JSON number`** | `1.0` | | +| **`decimal(P,S)`** | **`JSON string`** | `"14.20"`, `"2E+20"` | Stores the string representation of the decimal value, specifically, for values with a positive scale, the number of digits to the right of the decimal point is used to indicate scale, for values with a negative scale, the scientific notation is used and the exponent must equal the negated scale | +| **`date`** | **`JSON string`** | `"2017-11-16"` | Stores ISO-8601 standard date | +| **`time`** | **`JSON string`** | `"22:31:08.123456"` | Stores ISO-8601 standard time with microsecond precision | +| **`timestamp`** | **`JSON string`** | `"2017-11-16T22:31:08.123456"` | Stores ISO-8601 standard timestamp with microsecond precision; must not include a zone offset | +| **`timestamptz`** | **`JSON string`** | `"2017-11-16T22:31:08.123456+00:00"` | Stores ISO-8601 standard timestamp with microsecond precision; must include a zone offset and it must be '+00:00' | +| **`timestamp_ns`** | **`JSON string`** | `"2017-11-16T22:31:08.123456789"` | Stores ISO-8601 standard timestamp with nanosecond precision; must not include a zone offset | +| **`timestamptz_ns`** | **`JSON string`** | `"2017-11-16T22:31:08.123456789+00:00"` | Stores ISO-8601 standard timestamp with nanosecond precision; must include a zone offset and it must be '+00:00' | +| **`string`** | **`JSON string`** | `"iceberg"` | | +| **`uuid`** | **`JSON string`** | `"f79c3e09-677c-4bbd-a479-3f349cb785e7"` | Stores the lowercase uuid string | +| **`fixed(L)`** | **`JSON string`** | `"000102ff"` | Stored as a hexadecimal string | +| **`binary`** | **`JSON string`** | `"000102ff"` | Stored as a hexadecimal string | +| **`struct`** | **`JSON object by field ID`** | `{"1": 1, "2": "bar"}` | Stores struct fields using the field ID as the JSON field name; field values are stored using this JSON single-value format | +| **`list`** | **`JSON array of values`** | `[1, 2, 3]` | Stores a JSON array of values that are serialized using this JSON single-value format | +| **`map`** | **`JSON object of key and value arrays`** | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual keys and values are serialized using this JSON single-value format | +| **`variant`** | **`Same JSON representation in this table for stored type`** | `null`, `true`, `{"1": 1, "2": "bar"}` | The JSON representation matches the format shown in this table for the type stored in the Variant. | Review Comment: I think this is insufficient because it loses type information. For example, if there is a Variant that contains a timestamp, the timestamp's type is lost. This also appears to use the struct representation for an object, but a Variant has no field IDs. I think that we probably do want to use a JSON representation, but we will need to make sure that it aligns with the JSON conversion defined in the Variant spec (https://github.com/apache/parquet-format/pull/461/files#diff-80a56bf0d841087ab5038020e4b78119d6ceb44684719ba4d3a6e22effb36eb9R459) and that we have defined requirements for recovering the types. Note that the JSON conversion in the Variant spec differs from this section: * Binary values are hexadecimal strings here (:facepalm:) and base64 in Variant * Decimal values are strings here and numbers in Variant We may want to not allow Variant in JSON because of the type loss. Or we may want to specify a subset that can be recovered, like booleans, integers, floats, strings, and arrays. There are two places where this section is used: in the REST protocol for sending lower/upper bounds and in default values. We don't really need default values for variants (but would have to disallow them) and lower/upper bounds would generally work with a subset. The other option is to encode `variant` as a base64 binary value. That is a bit ugly in JSON, but it is not lossy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org