yihua commented on code in PR #18274:
URL: https://github.com/apache/hudi/pull/18274#discussion_r3036403407


##########
rfc/rfc-99/rfc-99.md:
##########
@@ -194,5 +195,444 @@ SQL Extensions needs to be added to define the table in a 
hudi type native way.
 
 TODO: There is an open question regarding the need to maintain type ids to 
track schema evolution and how it would interplay with NBCC. 
 
+---
+
+## Variant Type Implementation
+
+This section documents the implementation of the VARIANT type in Hudi, which 
provides first-class support for
+semi-structured data (e.g., JSON). The Variant type is implemented following 
Spark 4.0's native VariantType
+specification.
+
+### Overview
+
+The Variant type enables Hudi to store and query semi-structured data 
efficiently. It is particularly useful for:
+
+- Schema-on-read flexibility for evolving data structures
+- Storing JSON-like data without requiring predefined schemas
+
+### Motivation
+
+Hudi readers and writers should be able to handle datasets with variant types.
+This will allow users to work with semi-structured data more easily.
+
+The variant type is now formally defined in Parquet and engines like Spark 
have full support for this type.
+Users with semi-structured data are otherwise forced to use strings or byte 
arrays to store this data.
+
+### What is the VARIANT Type?
+
+The `VARIANT` type is a new data type designed to store semi-structured data 
(like JSON) efficiently.
+Unlike storing JSON as a plain string, `VARIANT` uses an optimized binary 
encoding that allows for fast navigation
+and element extraction without needing to parse the entire document.
+It offers the flexibility of a schema-less design (like JSON) with performance 
closer to structured columns.
+
+### Storage Modes: Shredded vs. Unshredded
+
+- **Unshredded** (Binary Blob):
+    - The entire JSON structure is encoded into binary metadata and value 
blobs.
+    - **Pros**: Fast write speed; handles completely dynamic/random schemas 
easily.
+    - **Cons**: To read a single field (e.g., `user.id`), the engine must load 
the entire binary blob.
+- **Shredded** (Columnar Optimization):
+    - The engine identifies common paths in the data (e.g., `v.a` or `v.c`) 
and extracts them into separate, native
+      Parquet columns (e.g., Int32, Decimal).
+    - **Pros**: Massive performance gain for queries. If you query `SELECT 
v:a`, the engine reads only the specific
+      Int32 column and skips the rest of the binary data (Columnar Pruning).
+    - **Cons**: Higher write overhead to analyze and split the data, prone to 
read jitter if there are large variation 
+    - in shredding output.

Review Comment:
   <a href="#"><img alt="P1" 
src="https://greptile-static-assets.s3.amazonaws.com/badges/p1.svg?v=7"; 
align="top"></a> **Broken list item splits the "Cons" sentence**
   
   Line 240 (`    - in shredding output.`) will render as an independent bullet 
point rather than as a continuation of the sentence on line 239. In Markdown, a 
list item that breaks across lines needs the continuation to be indented 
without a leading `- ` dash. As written, readers will see two disconnected 
bullets: one truncated sentence and one orphan fragment.
   
   ```suggestion
       - **Cons**: Higher write overhead to analyze and split the data, prone 
to read jitter if there are large variations
         in shredding output.
   ```
   
   — *Greptile* 
([original](https://github.com/yihua/hudi/pull/18#discussion_r3036402592)) 
(source:comment#3036402592)



##########
rfc/rfc-99/variant-appendix.md:
##########
@@ -0,0 +1,119 @@
+### Spark 4.0 Write & Shredding Verification

Review Comment:
   <a href="#"><img alt="P2" 
src="https://greptile-static-assets.s3.amazonaws.com/badges/p2.svg?v=7"; 
align="top"></a> **Missing top-level document heading**
   
   `variant-appendix.md` opens directly with a `###` (H3) heading, which means 
it has no document title when rendered standalone. `vector-appendix.md` (the 
pre-existing appendix this file is modelled after) should be checked for 
consistency, but a new standalone appendix file should start with at least an 
H1 or H2 heading so it is self-describing when viewed on GitHub or in 
documentation tooling.
   
   ```suggestion
   ## Variant Type – Appendix
   
   ### Spark 4.0 Write & Shredding Verification
   ```
   
   — *Greptile* 
([original](https://github.com/yihua/hudi/pull/18#discussion_r3036402598)) 
(source:comment#3036402598)



##########
rfc/rfc-99/variant-appendix.md:
##########
@@ -0,0 +1,119 @@
+### Spark 4.0 Write & Shredding Verification
+
+Spark 4.0 successfully writes VariantType in both Shredded and Unshredded 
modes.
+
+Physical Layout confirmed via `parquet-tools`: **Shredding** physically maps 
variant fields to native Parquet columns
+(e.g., `v.typed_value.a.typed_value` as INT32) alongside the standard 
`metadata` and `value` binary columns.
+
+#### Shredding Configuration
+
+The relevant Spark configuration for shredding is:
+
+```
+spark.sql.variant.writeShredding.enabled=true
+```
+
+You can also force a specific shredding schema for testing:
+
+```
+spark.sql.variant.forceShreddingSchemaForTest=<schema>
+```
+
+#### Shredded Parquet Physical Layout
+
+An example shredded variant column `v` with schema `a int, b string, c 
decimal(15, 1)` produces 8 physical columns:
+
+```
+v.metadata                       BYTE_ARRAY
+v.value                          BYTE_ARRAY
+v.typed_value.a.value            BYTE_ARRAY
+v.typed_value.a.typed_value      INT32
+v.typed_value.b.value            BYTE_ARRAY
+v.typed_value.b.typed_value      BYTE_ARRAY (String/UTF8)
+v.typed_value.c.value            BYTE_ARRAY
+v.typed_value.c.typed_value      INT64 (Decimal(precision=15, scale=1))
+```
+
+#### Unshredded Parquet Physical Layout
+
+An unshredded variant column `v` produces only 2 physical columns:
+
+```
+v.value      BYTE_ARRAY
+v.metadata   BYTE_ARRAY
+```
+
+### Root-Level Shredding
+
+Variants can also be shredded at the root level (i.e., when the variant value 
is a scalar, not an object).
+
+Example: forcing `bigint` as the shredding schema with values like `100`, 
`"hello_world"`, and `{"A": 1, "c": 1.23}`
+will shred the root-level integer into a typed column while falling back to 
the binary blob for non-matching types.
+
+```python
+spark.conf.set("spark.sql.variant.forceShreddingSchemaForTest", "bigint")
+df = spark.sql("""
+    select case
+        when id = 0 then parse_json('100')
+        when id = 1 then parse_json('"hello_world"')
+        when id = 2 then parse_json('{"A": 1, "c": 1.23}')
+    end as v
+    from range(3)
+""")
+```
+
+### Backward Compatibility (Spark 3.5 Reading VARIANT written by Spark 4.0)
+
+1. **Logical Type Failure**: Spark 3.5 cannot interpret the `Variant` logical 
type from the Parquet footer, triggering
+   warnings and ignoring the serialized Spark schema:
+    ```
+    WARN ParquetFileFormat: Failed to parse and ignored serialized Spark 
schema in Parquet key-value metadata:
+    
{"type":"struct","fields":[{"name":"v","type":"variant","nullable":true,"metadata":{}}]}
+    ```
+
+2. **Physical Read Success (Raw Binary)**: The data is still readable, but 
only as raw physical columns:
+    - **Unshredded**: Reads as `Struct<value: Binary, metadata: Binary>`
+    - **Shredded**: Reads as a complex `Struct` containing the binaries plus 
the nested typed columns:
+      ```
+      root
+       |-- v: struct (nullable = true)
+       |    |-- metadata: binary (nullable = true)
+       |    |-- value: binary (nullable = true)
+       |    |-- typed_value: struct (nullable = true)
+       |    |    |-- a: struct (nullable = true)
+       |    |    |    |-- value: binary (nullable = true)
+       |    |    |    |-- typed_value: integer (nullable = true)
+       |    |    |-- b: struct (nullable = true)
+       |    |    |    |-- value: binary (nullable = true)
+       |    |    |    |-- typed_value: string (nullable = true)
+       |    |    |-- c: struct (nullable = true)
+       |    |    |    |-- value: binary (nullable = true)
+       |    |    |    |-- typed_value: decimal(15,1) (nullable = true)
+      ```
+
+3. **The Consequence**: Users on older engine versions can access the bytes, 
but they lose all JSON/Variant
+   functionality.
+   They would need to manually parse the internal binary encoding to make 
sense of the data.
+
+### Shredding Read Compatibility Issue
+
+Writing with shredding enabled and then reading with the shredded flag 
disabled (
+`spark.sql.variant.allowReadingShredded=false`)
+causes a read failure:
+
+```
+[INVALID_VARIANT_FROM_PARQUET.WRONG_NUM_FIELDS]
+Invalid variant. Variant column must contain exactly two fields.
+```
+
+This occurs because the Parquet schema converter expects exactly two fields 
(`metadata` and `value`) for a variant
+column,
+but the shredded layout adds additional `typed_value` fields.
+This means engines that do not support shredded variants cannot read shredded 
parquet files as variant types.
+
+### Appendix
+
+- Source issue: https://github.com/apache/hudi/issues/17744
+- Parquet testing reference: 
https://github.com/apache/parquet-testing/issues/75#issue-2976445672
+- Spark shredding config
+  source: 
https://github.com/apache/spark/blob/b615f34aebd760a8b3f67d35e13dca2e78fa5765/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L5678-L5684

Review Comment:
   <a href="#"><img alt="P2" 
src="https://greptile-static-assets.s3.amazonaws.com/badges/p2.svg?v=7"; 
align="top"></a> **Missing newline at end of file**
   
   The file is missing a trailing newline (flagged by the git diff with `\ No 
newline at end of file`). This can cause issues with some tools and linters, 
and is inconsistent with POSIX file conventions.
   
   ```suggestion
     source: 
https://github.com/apache/spark/blob/b615f34aebd760a8b3f67d35e13dca2e78fa5765/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L5678-L5684
   ```
   
   — *Greptile* 
([original](https://github.com/yihua/hudi/pull/18#discussion_r3036402605)) 
(source:comment#3036402605)



##########
rfc/rfc-99/rfc-99.md:
##########
@@ -194,5 +195,444 @@ SQL Extensions needs to be added to define the table in a 
hudi type native way.
 
 TODO: There is an open question regarding the need to maintain type ids to 
track schema evolution and how it would interplay with NBCC. 
 
+---
+
+## Variant Type Implementation
+
+This section documents the implementation of the VARIANT type in Hudi, which 
provides first-class support for
+semi-structured data (e.g., JSON). The Variant type is implemented following 
Spark 4.0's native VariantType
+specification.
+
+### Overview
+
+The Variant type enables Hudi to store and query semi-structured data 
efficiently. It is particularly useful for:
+
+- Schema-on-read flexibility for evolving data structures
+- Storing JSON-like data without requiring predefined schemas
+
+### Motivation
+
+Hudi readers and writers should be able to handle datasets with variant types.
+This will allow users to work with semi-structured data more easily.
+
+The variant type is now formally defined in Parquet and engines like Spark 
have full support for this type.
+Users with semi-structured data are otherwise forced to use strings or byte 
arrays to store this data.
+
+### What is the VARIANT Type?
+
+The `VARIANT` type is a new data type designed to store semi-structured data 
(like JSON) efficiently.
+Unlike storing JSON as a plain string, `VARIANT` uses an optimized binary 
encoding that allows for fast navigation
+and element extraction without needing to parse the entire document.
+It offers the flexibility of a schema-less design (like JSON) with performance 
closer to structured columns.
+
+### Storage Modes: Shredded vs. Unshredded
+
+- **Unshredded** (Binary Blob):
+    - The entire JSON structure is encoded into binary metadata and value 
blobs.
+    - **Pros**: Fast write speed; handles completely dynamic/random schemas 
easily.
+    - **Cons**: To read a single field (e.g., `user.id`), the engine must load 
the entire binary blob.
+- **Shredded** (Columnar Optimization):
+    - The engine identifies common paths in the data (e.g., `v.a` or `v.c`) 
and extracts them into separate, native
+      Parquet columns (e.g., Int32, Decimal).
+    - **Pros**: Massive performance gain for queries. If you query `SELECT 
v:a`, the engine reads only the specific
+      Int32 column and skips the rest of the binary data (Columnar Pruning).
+    - **Cons**: Higher write overhead to analyze and split the data, prone to 
read jitter if there are large variation 
+    - in shredding output.

Review Comment:
   _⚠️ Potential issue_ | _🟡 Minor_
   
   **Fix grammatical error.**
   
   Line 239 contains a grammatical error: "large variation" should be "large 
variations" (plural).
   
   <details>
   <summary>📝 Proposed fix</summary>
   
   ```diff
   -    - **Cons**: Higher write overhead to analyze and split the data, prone 
to read jitter if there are large variation 
   -    - in shredding output.
   +    - **Cons**: Higher write overhead to analyze and split the data, prone 
to read jitter if there are large variations
   +      in shredding output.
   ```
   </details>
   
   <!-- suggestion_start -->
   
   <details>
   <summary>📝 Committable suggestion</summary>
   
   > ‼️ **IMPORTANT**
   > Carefully review the code before committing. Ensure that it accurately 
replaces the highlighted code, contains no missing lines, and has no issues 
with indentation. Thoroughly test & benchmark the code to ensure it meets the 
requirements.
   
   ```suggestion
       - **Cons**: Higher write overhead to analyze and split the data, prone 
to read jitter if there are large variations
         in shredding output.
   ```
   
   </details>
   
   <!-- suggestion_end -->
   
   <details>
   <summary>🤖 Prompt for AI Agents</summary>
   
   ```
   Verify each finding against the current code and only fix it if needed.
   
   In `@rfc/rfc-99/rfc-99.md` around lines 239 - 240, The bullet under "- 
**Cons**:
   Higher write overhead to analyze and split the data, prone to read jitter if
   there are large variation" contains a grammatical error: change the phrase
   "large variation" to "large variations" so the line reads "...prone to read
   jitter if there are large variations in shredding output." Update the 
sentence
   in rfc-99.md where that exact bullet text appears.
   ```
   
   </details>
   
   <!-- fingerprinting:phantom:triton:puma:95b5fe94-0d6d-4eb5-a363-fc8e5d41a9b6 
-->
   
   <!-- This is an auto-generated comment by CodeRabbit -->
   
   — *CodeRabbit* 
([original](https://github.com/yihua/hudi/pull/18#discussion_r3036403284)) 
(source:comment#3036403284)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to