XuQianJin-Stars opened a new issue, #2873:
URL: https://github.com/apache/fluss/issues/2873

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and 
found nothing similar.
   
   
   ### Motivation
   
   ### Background
   
   Semi-structured data (e.g., JSON) is increasingly common in modern data 
pipelines. Many query engines and storage systems (such as Apache Spark, Apache 
Iceberg, and Apache Paimon) have adopted a **VARIANT** data type to efficiently 
represent and query semi-structured data using a compact binary encoding, 
rather than storing raw JSON strings.
   
   Currently, Fluss treats VARIANT internally as plain `byte[]`, which has 
several limitations:
   
   1. **Loss of semantic structure**: A single `byte[]` conflates the variant's 
value and metadata (string dictionary) into one opaque blob. Downstream 
consumers must know the internal wire format (`[4-byte value length][value 
bytes][metadata bytes]`) to decode it correctly.
   2. **Inconsistent API**: All other complex types in Fluss (e.g., 
`InternalArray`, `InternalMap`, `InternalRow`) have dedicated first-class types 
in the row infrastructure, while VARIANT does not.
   3. **Poor interoperability with lake formats**: When writing to lake formats 
(Paimon, Iceberg, Lance), the VARIANT data must be split into separate `value` 
and `metadata` components. Using `byte[]` forces every integration point to 
re-implement the split/merge logic.
   4. **No alignment with industry standards**: Apache Paimon has already 
introduced a full `Variant` interface with `value()` and `metadata()` 
accessors, following the [Variant Binary Encoding 
spec](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md). 
Fluss should align with this design for ecosystem consistency.
   
   ### Use Case
   
   - Users ingesting JSON or semi-structured data into Fluss tables should 
benefit from efficient binary encoding and per-path access without full 
deserialization.
   - Lake connector writers (Paimon, Iceberg, Lance) need structured access to 
`value` and `metadata` separately.
   - A first-class `Variant` type enables future optimizations like predicate 
pushdown on variant paths.
   
   
   ### Solution
   
   ### Proposed Design
   
   Introduce a first-class `Variant` interface and `GenericVariant` 
implementation throughout Fluss's row infrastructure, following the same 
pattern as Apache Paimon's Variant design.
   
   #### 1. Core Types
   
   - **`Variant` interface** (`fluss-common/.../row/Variant.java`)
     - `byte[] value()` — returns the binary-encoded variant value (header + 
data)
     - `byte[] metadata()` — returns the string dictionary (version + 
deduplicated object key names)
     - `long sizeInBytes()` — total byte size
     - `Variant copy()` — deep copy
     - Static helpers: `bytesToVariant(byte[])` and `variantToBytes(Variant)` 
for backward-compatible wire format conversion
   
   - **`GenericVariant` class** (`fluss-common/.../row/GenericVariant.java`)
     - Implements `Variant` and `Serializable`
     - Stores two `byte[]` fields: `value` and `metadata`
     - Proper `equals()`, `hashCode()`, `toString()`
   
   #### 2. Row Infrastructure Changes
   
   | Layer | Change |
   |-------|--------|
   | **DataGetters** | Add `Variant getVariant(int pos)` |
   | **BinaryWriter** | Add `writeVariant(int pos, Variant value)` |
   | **All InternalRow implementations** | Implement `getVariant()` — 
`GenericRow`, `BinaryRow`, `CompactedRow`, `IndexedRow`, `ProjectedRow`, 
`PaddingRow`, `ColumnarRow`, etc. |
   | **All InternalArray implementations** | Implement `getVariant()` — 
`GenericArray`, `BinaryArray`, `ColumnarArray` |
   | **Readers/Writers** | `CompactedRowReader/Writer`, 
`IndexedRowReader/Writer` — add `readVariant()`/`writeVariant(Variant)` |
   
   #### 3. Binary Storage Format (Backward Compatible)
   
   The on-wire format remains unchanged for compatibility:
   `Variant.variantToBytes()` and `Variant.bytesToVariant()` handle the 
conversion.
   
   #### 4. Integration Points
   
   - **Lake connectors** (Paimon, Iceberg, Lance): Encoders/decoders use 
`Variant` directly instead of raw `byte[]`
   - **Flink bridge**: `FlussRowToFlinkRowConverter` converts `Variant` → 
`byte[]` for Flink compatibility
   - **Client converters**: `PojoToRowConverter` / `RowToPojoConverter` support 
both `byte[]` and `Variant` inputs
   - **Utilities**: `InternalRowUtils`, `TypeUtils`, `PartitionUtils` updated 
accordingly
   
   #### 5. References
   
   - [Variant Binary Encoding Spec 
(Parquet)](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
   - [Apache Paimon Variant 
Implementation](https://github.com/apache/paimon/tree/master/paimon-common/src/main/java/org/apache/paimon/data/variant)
   - [Apache Spark VARIANT 
FLIP](https://issues.apache.org/jira/browse/SPARK-45891)
   
   
   ### Anything else?
   
   _No response_
   
   ### Willingness to contribute
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to