aihuaxu commented on code in PR #10831: URL: https://github.com/apache/iceberg/pull/10831#discussion_r1885094847
########## format/spec.md: ########## @@ -182,6 +182,21 @@ A **`list`** is a collection of values with some element type. The element field A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types. +#### Semi-structured Types + +A **`variant`** is a value that stores semi-structured data. The structure and data types in a variant are not necessarily consistent across rows in a table or data file. The variant type and binary encoding are defined in the [Parquet project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md). Support for Variant is added in Iceberg v3. + +Variants are similar to JSON with a wider set of primitive values including date, timestamp, timestamptz, binary, and floating points. + +Variant values may contain nested types: +1. An array is an ordered collection of variant values. +2. An object is a collection of fields that are a string key and a variant value. + +As a semi-structured type, there are important differences between variant and Iceberg's other types: +1. Variant arrays are similar to lists, but may contain any variant value rather than a fixed element type. +2. Variant objects are similar to structs, but may contain variable fields identified by name and field values may be any variant value rather than a fixed field type. +3. Variant primitives are narrower than Iceberg's primitive types: time, timestamp_ns, timestamptz_ns, uuid, and fixed(L) are not supported. Review Comment: Yes. We can use string instead. Also, we need to think of how to represent fixed(L) of length L if we want to support it. We need to encode L in the type if we don't lose such information while our type field only has 5 bits. ########## format/spec.md: ########## @@ -1208,6 +1224,7 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo | **`struct`** | `group` | | | | **`list`** | `3-level list` | `LIST` | See Parquet docs for 3-level representation. | | **`map`** | `3-level map` | `MAP` | See Parquet docs for 3-level representation. | +| **`variant`** | `group` with `metadata` and `value` fields. `metadata` and `value` must not be assigned field IDs and the fields are accessed through names.| `VARIANT` | See Parquet docs for Variant encoding and Variant shredding encoding. | Review Comment: Sure. There is a discussion if I need to link to the files on the main branch or a particular commit above. For now, I will link to the ones on a commit to reflect the current state. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org