Re: [PR] Spec: add variant type [iceberg]

via GitHub Thu, 05 Dec 2024 05:53:54 -0800


findepi commented on code in PR #10831:
URL: https://github.com/apache/iceberg/pull/10831#discussion_r1871412309



##########
format/spec.md:
##########
@@ -182,6 +182,21 @@ A **`list`** is a collection of values with some element 
type. The element field
 
 A **`map`** is a collection of key-value pairs with a key type and a value 
type. Both the key field and value field each have an integer id that is unique 
in the table schema. Map keys are required and map values can be either 
optional or required. Both map keys and map values may be any type, including 
nested types.
 
+#### Semi-structured Types
+
+A **`variant`** is a value that stores semi-structured data. The structure and 
data types in a variant are not necessarily consistent across rows in a table 
or data file. The variant type and binary encoding are defined in the [Parquet 
project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md).
 Support for Variant is added in Iceberg v3.

Review Comment:
   From the linked document
   
   > Important
   > 
   > This specification is still under active development, and has not been 
formally adopted.
   
   assuming Parquet-level spec if subject to change, what are the conditions to 
release Iceberg 3 with variant support in the spec?
   
   Secondly, the linked document talks about shredding.
   How does this interact with Iceberg field IDs in the parquet metadata?
   Do all the columns share field ID, or is only the first column supposed to 
be annotated with the field ID?
   Let's make it explicit.



##########
format/spec.md:
##########
@@ -1208,6 +1224,7 @@ Lists must use the [3-level 
representation](https://github.com/apache/parquet-fo
 | **`struct`**       | `group`                                                 
           |                                             |                      
                                          |
 | **`list`**         | `3-level list`                                          
           | `LIST`                                      | See Parquet docs for 
3-level representation.                   |
 | **`map`**          | `3-level map`                                           
           | `MAP`                                       | See Parquet docs for 
3-level representation.                   |
+| **`variant`**      | `group` with `metadata` and `value` fields. `metadata` 
and `value` must not be assigned field IDs.| `VARIANT`                          
         | See Parquet docs for Variant encoding and Variant shredding 
encoding. |

Review Comment:
   If these don't have field ID, how should the reader locate the contents of a 
variant iceberg field?
   are these mapped by name?
   
   maybe that's obvious, but let's make it explicit.



##########
format/spec.md:
##########
@@ -182,6 +182,21 @@ A **`list`** is a collection of values with some element 
type. The element field
 
 A **`map`** is a collection of key-value pairs with a key type and a value 
type. Both the key field and value field each have an integer id that is unique 
in the table schema. Map keys are required and map values can be either 
optional or required. Both map keys and map values may be any type, including 
nested types.
 
+#### Semi-structured Types
+
+A **`variant`** is a value that stores semi-structured data. The structure and 
data types in a variant are not necessarily consistent across rows in a table 
or data file. The variant type and binary encoding are defined in the [Parquet 
project](https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/VariantEncoding.md).
 Support for Variant is added in Iceberg v3.
+
+Variants are similar to JSON with a wider set of primitive values including 
date, timestamp, timestamptz, binary, and floating points.

Review Comment:
   if this documents the difference from json, let's skip floats and add 
decimals
   if this documents all types variant supports, let's add integers, decimals, 
string



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: add variant type [iceberg]

Reply via email to