aihuaxu commented on code in PR #10831:
URL: https://github.com/apache/iceberg/pull/10831#discussion_r1874623947


##########
format/spec.md:
##########
@@ -178,6 +178,21 @@ A **`list`** is a collection of values with some element 
type. The element field
 
 A **`map`** is a collection of key-value pairs with a key type and a value 
type. Both the key field and value field each have an integer id that is unique 
in the table schema. Map keys are required and map values can be either 
optional or required. Both map keys and map values may be any type, including 
nested types.
 
+#### Semi-structured Types
+
+A **`variant`** is a value that stores semi-structured data. The structure and 
data types in a variant are not necessarily consistent across rows in a table 
or data file. The variant type and binary encoding are defined in the [Parquet 
project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md).
 Support for Variant is added in Iceberg v3.

Review Comment:
   I think we don't want to duplicate the content the actual spec in Parquet. 
Basically what mentioned in the parquet spec should be included.  



##########
format/spec.md:
##########
@@ -182,6 +182,21 @@ A **`list`** is a collection of values with some element 
type. The element field
 
 A **`map`** is a collection of key-value pairs with a key type and a value 
type. Both the key field and value field each have an integer id that is unique 
in the table schema. Map keys are required and map values can be either 
optional or required. Both map keys and map values may be any type, including 
nested types.
 
+#### Semi-structured Types
+
+A **`variant`** is a value that stores semi-structured data. The structure and 
data types in a variant are not necessarily consistent across rows in a table 
or data file. The variant type and binary encoding are defined in the [Parquet 
project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md).
 Support for Variant is added in Iceberg v3.
+
+Variants are similar to JSON with a wider set of primitive values including 
date, timestamp, timestamptz, binary, and floating points.

Review Comment:
   Updated to add decimals and remove floats. 



##########
format/spec.md:
##########
@@ -182,6 +182,21 @@ A **`list`** is a collection of values with some element 
type. The element field
 
 A **`map`** is a collection of key-value pairs with a key type and a value 
type. Both the key field and value field each have an integer id that is unique 
in the table schema. Map keys are required and map values can be either 
optional or required. Both map keys and map values may be any type, including 
nested types.
 
+#### Semi-structured Types
+
+A **`variant`** is a value that stores semi-structured data. The structure and 
data types in a variant are not necessarily consistent across rows in a table 
or data file. The variant type and binary encoding are defined in the [Parquet 
project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md).
 Support for Variant is added in Iceberg v3.

Review Comment:
   From previous discussion, the community is interested in both basic Variant 
type support and shredding for better performance. I can see basic variant 
encoding is settled - we could add additional types; I think we need finalize 
the shredding spec so the encoding doesn't change.
   
   Regarding shredding, are you referring to shredded subcolumns from a 
Variant? I'm thinking that we can clarify in shredding spec (probably after 
https://github.com/apache/parquet-format/pull/461/files#diff-95f43ac21fdadae78c95da23444ed7a4036a4993e9faa2ee5d8b2c29ef6d8056).
 The top variant column has the field ID and the subcolumns are accessed 
through the path like `location.lattitude`.  
    
   > Secondly, the linked document talks about shredding.
   > How does this interact with Iceberg field IDs in the parquet metadata?
   > Do all the columns share field ID, or is only the first column supposed to 
be annotated with the field ID?
   > Let's make it explicit.



##########
format/spec.md:
##########
@@ -182,6 +182,21 @@ A **`list`** is a collection of values with some element 
type. The element field
 
 A **`map`** is a collection of key-value pairs with a key type and a value 
type. Both the key field and value field each have an integer id that is unique 
in the table schema. Map keys are required and map values can be either 
optional or required. Both map keys and map values may be any type, including 
nested types.
 
+#### Semi-structured Types
+
+A **`variant`** is a value that stores semi-structured data. The structure and 
data types in a variant are not necessarily consistent across rows in a table 
or data file. The variant type and binary encoding are defined in the [Parquet 
project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md).
 Support for Variant is added in Iceberg v3.
+
+Variants are similar to JSON with a wider set of primitive values including 
date, timestamp, timestamptz, binary, and floating points.
+
+Variant values may contain nested types:
+1. An array is an ordered collection of variant values.
+2. An object is a collection of fields that are a string key and a variant 
value.
+
+As a semi-structured type, there are important differences between variant and 
Iceberg's other types:
+1. Variant arrays are similar to lists, but may contain any variant value 
rather than a fixed element type.
+2. Variant objects are similar to structs, but may contain variable fields 
identified by name and field values may be any variant value rather than a 
fixed field type.
+3. Variant primitives are narrower than Iceberg's primitive types: time, 
timestamp_ns, timestamptz_ns, uuid, and fixed(L) are not supported.

Review Comment:
   Are you talking about the Variant spec change 
https://github.com/apache/parquet-format/pull/461 and 
https://github.com/apache/parquet-format/pull/464? I think we will. 



##########
format/spec.md:
##########
@@ -1208,6 +1224,7 @@ Lists must use the [3-level 
representation](https://github.com/apache/parquet-fo
 | **`struct`**       | `group`                                                 
           |                                             |                      
                                          |
 | **`list`**         | `3-level list`                                          
           | `LIST`                                      | See Parquet docs for 
3-level representation.                   |
 | **`map`**          | `3-level map`                                           
           | `MAP`                                       | See Parquet docs for 
3-level representation.                   |
+| **`variant`**      | `group` with `metadata` and `value` fields. `metadata` 
and `value` must not be assigned field IDs.| `VARIANT`                          
         | See Parquet docs for Variant encoding and Variant shredding 
encoding. |

Review Comment:
   For variant type groups in Parquet, they are expected to have fixed `value` 
and `metadata` fields and they are read through the names. Let me add that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to