yihua commented on code in PR #18274: URL: https://github.com/apache/hudi/pull/18274#discussion_r3036403407
########## rfc/rfc-99/rfc-99.md: ########## @@ -194,5 +195,444 @@ SQL Extensions needs to be added to define the table in a hudi type native way. TODO: There is an open question regarding the need to maintain type ids to track schema evolution and how it would interplay with NBCC. +--- + +## Variant Type Implementation + +This section documents the implementation of the VARIANT type in Hudi, which provides first-class support for +semi-structured data (e.g., JSON). The Variant type is implemented following Spark 4.0's native VariantType +specification. + +### Overview + +The Variant type enables Hudi to store and query semi-structured data efficiently. It is particularly useful for: + +- Schema-on-read flexibility for evolving data structures +- Storing JSON-like data without requiring predefined schemas + +### Motivation + +Hudi readers and writers should be able to handle datasets with variant types. +This will allow users to work with semi-structured data more easily. + +The variant type is now formally defined in Parquet and engines like Spark have full support for this type. +Users with semi-structured data are otherwise forced to use strings or byte arrays to store this data. + +### What is the VARIANT Type? + +The `VARIANT` type is a new data type designed to store semi-structured data (like JSON) efficiently. +Unlike storing JSON as a plain string, `VARIANT` uses an optimized binary encoding that allows for fast navigation +and element extraction without needing to parse the entire document. +It offers the flexibility of a schema-less design (like JSON) with performance closer to structured columns. + +### Storage Modes: Shredded vs. Unshredded + +- **Unshredded** (Binary Blob): + - The entire JSON structure is encoded into binary metadata and value blobs. + - **Pros**: Fast write speed; handles completely dynamic/random schemas easily. + - **Cons**: To read a single field (e.g., `user.id`), the engine must load the entire binary blob. +- **Shredded** (Columnar Optimization): + - The engine identifies common paths in the data (e.g., `v.a` or `v.c`) and extracts them into separate, native + Parquet columns (e.g., Int32, Decimal). + - **Pros**: Massive performance gain for queries. If you query `SELECT v:a`, the engine reads only the specific + Int32 column and skips the rest of the binary data (Columnar Pruning). + - **Cons**: Higher write overhead to analyze and split the data, prone to read jitter if there are large variation + - in shredding output. Review Comment: <a href="#"><img alt="P1" src="https://greptile-static-assets.s3.amazonaws.com/badges/p1.svg?v=7" align="top"></a> **Broken list item splits the "Cons" sentence** Line 240 (` - in shredding output.`) will render as an independent bullet point rather than as a continuation of the sentence on line 239. In Markdown, a list item that breaks across lines needs the continuation to be indented without a leading `- ` dash. As written, readers will see two disconnected bullets: one truncated sentence and one orphan fragment. ```suggestion - **Cons**: Higher write overhead to analyze and split the data, prone to read jitter if there are large variations in shredding output. ``` — *Greptile* ([original](https://github.com/yihua/hudi/pull/18#discussion_r3036402592)) (source:comment#3036402592) ########## rfc/rfc-99/variant-appendix.md: ########## @@ -0,0 +1,119 @@ +### Spark 4.0 Write & Shredding Verification Review Comment: <a href="#"><img alt="P2" src="https://greptile-static-assets.s3.amazonaws.com/badges/p2.svg?v=7" align="top"></a> **Missing top-level document heading** `variant-appendix.md` opens directly with a `###` (H3) heading, which means it has no document title when rendered standalone. `vector-appendix.md` (the pre-existing appendix this file is modelled after) should be checked for consistency, but a new standalone appendix file should start with at least an H1 or H2 heading so it is self-describing when viewed on GitHub or in documentation tooling. ```suggestion ## Variant Type – Appendix ### Spark 4.0 Write & Shredding Verification ``` — *Greptile* ([original](https://github.com/yihua/hudi/pull/18#discussion_r3036402598)) (source:comment#3036402598) ########## rfc/rfc-99/variant-appendix.md: ########## @@ -0,0 +1,119 @@ +### Spark 4.0 Write & Shredding Verification + +Spark 4.0 successfully writes VariantType in both Shredded and Unshredded modes. + +Physical Layout confirmed via `parquet-tools`: **Shredding** physically maps variant fields to native Parquet columns +(e.g., `v.typed_value.a.typed_value` as INT32) alongside the standard `metadata` and `value` binary columns. + +#### Shredding Configuration + +The relevant Spark configuration for shredding is: + +``` +spark.sql.variant.writeShredding.enabled=true +``` + +You can also force a specific shredding schema for testing: + +``` +spark.sql.variant.forceShreddingSchemaForTest=<schema> +``` + +#### Shredded Parquet Physical Layout + +An example shredded variant column `v` with schema `a int, b string, c decimal(15, 1)` produces 8 physical columns: + +``` +v.metadata BYTE_ARRAY +v.value BYTE_ARRAY +v.typed_value.a.value BYTE_ARRAY +v.typed_value.a.typed_value INT32 +v.typed_value.b.value BYTE_ARRAY +v.typed_value.b.typed_value BYTE_ARRAY (String/UTF8) +v.typed_value.c.value BYTE_ARRAY +v.typed_value.c.typed_value INT64 (Decimal(precision=15, scale=1)) +``` + +#### Unshredded Parquet Physical Layout + +An unshredded variant column `v` produces only 2 physical columns: + +``` +v.value BYTE_ARRAY +v.metadata BYTE_ARRAY +``` + +### Root-Level Shredding + +Variants can also be shredded at the root level (i.e., when the variant value is a scalar, not an object). + +Example: forcing `bigint` as the shredding schema with values like `100`, `"hello_world"`, and `{"A": 1, "c": 1.23}` +will shred the root-level integer into a typed column while falling back to the binary blob for non-matching types. + +```python +spark.conf.set("spark.sql.variant.forceShreddingSchemaForTest", "bigint") +df = spark.sql(""" + select case + when id = 0 then parse_json('100') + when id = 1 then parse_json('"hello_world"') + when id = 2 then parse_json('{"A": 1, "c": 1.23}') + end as v + from range(3) +""") +``` + +### Backward Compatibility (Spark 3.5 Reading VARIANT written by Spark 4.0) + +1. **Logical Type Failure**: Spark 3.5 cannot interpret the `Variant` logical type from the Parquet footer, triggering + warnings and ignoring the serialized Spark schema: + ``` + WARN ParquetFileFormat: Failed to parse and ignored serialized Spark schema in Parquet key-value metadata: + {"type":"struct","fields":[{"name":"v","type":"variant","nullable":true,"metadata":{}}]} + ``` + +2. **Physical Read Success (Raw Binary)**: The data is still readable, but only as raw physical columns: + - **Unshredded**: Reads as `Struct<value: Binary, metadata: Binary>` + - **Shredded**: Reads as a complex `Struct` containing the binaries plus the nested typed columns: + ``` + root + |-- v: struct (nullable = true) + | |-- metadata: binary (nullable = true) + | |-- value: binary (nullable = true) + | |-- typed_value: struct (nullable = true) + | | |-- a: struct (nullable = true) + | | | |-- value: binary (nullable = true) + | | | |-- typed_value: integer (nullable = true) + | | |-- b: struct (nullable = true) + | | | |-- value: binary (nullable = true) + | | | |-- typed_value: string (nullable = true) + | | |-- c: struct (nullable = true) + | | | |-- value: binary (nullable = true) + | | | |-- typed_value: decimal(15,1) (nullable = true) + ``` + +3. **The Consequence**: Users on older engine versions can access the bytes, but they lose all JSON/Variant + functionality. + They would need to manually parse the internal binary encoding to make sense of the data. + +### Shredding Read Compatibility Issue + +Writing with shredding enabled and then reading with the shredded flag disabled ( +`spark.sql.variant.allowReadingShredded=false`) +causes a read failure: + +``` +[INVALID_VARIANT_FROM_PARQUET.WRONG_NUM_FIELDS] +Invalid variant. Variant column must contain exactly two fields. +``` + +This occurs because the Parquet schema converter expects exactly two fields (`metadata` and `value`) for a variant +column, +but the shredded layout adds additional `typed_value` fields. +This means engines that do not support shredded variants cannot read shredded parquet files as variant types. + +### Appendix + +- Source issue: https://github.com/apache/hudi/issues/17744 +- Parquet testing reference: https://github.com/apache/parquet-testing/issues/75#issue-2976445672 +- Spark shredding config + source: https://github.com/apache/spark/blob/b615f34aebd760a8b3f67d35e13dca2e78fa5765/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L5678-L5684 Review Comment: <a href="#"><img alt="P2" src="https://greptile-static-assets.s3.amazonaws.com/badges/p2.svg?v=7" align="top"></a> **Missing newline at end of file** The file is missing a trailing newline (flagged by the git diff with `\ No newline at end of file`). This can cause issues with some tools and linters, and is inconsistent with POSIX file conventions. ```suggestion source: https://github.com/apache/spark/blob/b615f34aebd760a8b3f67d35e13dca2e78fa5765/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L5678-L5684 ``` — *Greptile* ([original](https://github.com/yihua/hudi/pull/18#discussion_r3036402605)) (source:comment#3036402605) ########## rfc/rfc-99/rfc-99.md: ########## @@ -194,5 +195,444 @@ SQL Extensions needs to be added to define the table in a hudi type native way. TODO: There is an open question regarding the need to maintain type ids to track schema evolution and how it would interplay with NBCC. +--- + +## Variant Type Implementation + +This section documents the implementation of the VARIANT type in Hudi, which provides first-class support for +semi-structured data (e.g., JSON). The Variant type is implemented following Spark 4.0's native VariantType +specification. + +### Overview + +The Variant type enables Hudi to store and query semi-structured data efficiently. It is particularly useful for: + +- Schema-on-read flexibility for evolving data structures +- Storing JSON-like data without requiring predefined schemas + +### Motivation + +Hudi readers and writers should be able to handle datasets with variant types. +This will allow users to work with semi-structured data more easily. + +The variant type is now formally defined in Parquet and engines like Spark have full support for this type. +Users with semi-structured data are otherwise forced to use strings or byte arrays to store this data. + +### What is the VARIANT Type? + +The `VARIANT` type is a new data type designed to store semi-structured data (like JSON) efficiently. +Unlike storing JSON as a plain string, `VARIANT` uses an optimized binary encoding that allows for fast navigation +and element extraction without needing to parse the entire document. +It offers the flexibility of a schema-less design (like JSON) with performance closer to structured columns. + +### Storage Modes: Shredded vs. Unshredded + +- **Unshredded** (Binary Blob): + - The entire JSON structure is encoded into binary metadata and value blobs. + - **Pros**: Fast write speed; handles completely dynamic/random schemas easily. + - **Cons**: To read a single field (e.g., `user.id`), the engine must load the entire binary blob. +- **Shredded** (Columnar Optimization): + - The engine identifies common paths in the data (e.g., `v.a` or `v.c`) and extracts them into separate, native + Parquet columns (e.g., Int32, Decimal). + - **Pros**: Massive performance gain for queries. If you query `SELECT v:a`, the engine reads only the specific + Int32 column and skips the rest of the binary data (Columnar Pruning). + - **Cons**: Higher write overhead to analyze and split the data, prone to read jitter if there are large variation + - in shredding output. Review Comment: _⚠️ Potential issue_ | _🟡 Minor_ **Fix grammatical error.** Line 239 contains a grammatical error: "large variation" should be "large variations" (plural). <details> <summary>📝 Proposed fix</summary> ```diff - - **Cons**: Higher write overhead to analyze and split the data, prone to read jitter if there are large variation - - in shredding output. + - **Cons**: Higher write overhead to analyze and split the data, prone to read jitter if there are large variations + in shredding output. ``` </details> <!-- suggestion_start --> <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion - **Cons**: Higher write overhead to analyze and split the data, prone to read jitter if there are large variations in shredding output. ``` </details> <!-- suggestion_end --> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@rfc/rfc-99/rfc-99.md` around lines 239 - 240, The bullet under "- **Cons**: Higher write overhead to analyze and split the data, prone to read jitter if there are large variation" contains a grammatical error: change the phrase "large variation" to "large variations" so the line reads "...prone to read jitter if there are large variations in shredding output." Update the sentence in rfc-99.md where that exact bullet text appears. ``` </details> <!-- fingerprinting:phantom:triton:puma:95b5fe94-0d6d-4eb5-a363-fc8e5d41a9b6 --> <!-- This is an auto-generated comment by CodeRabbit --> — *CodeRabbit* ([original](https://github.com/yihua/hudi/pull/18#discussion_r3036403284)) (source:comment#3036403284) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
