XuQianJin-Stars opened a new issue, #2873: URL: https://github.com/apache/fluss/issues/2873
### Search before asking - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and found nothing similar. ### Motivation ### Background Semi-structured data (e.g., JSON) is increasingly common in modern data pipelines. Many query engines and storage systems (such as Apache Spark, Apache Iceberg, and Apache Paimon) have adopted a **VARIANT** data type to efficiently represent and query semi-structured data using a compact binary encoding, rather than storing raw JSON strings. Currently, Fluss treats VARIANT internally as plain `byte[]`, which has several limitations: 1. **Loss of semantic structure**: A single `byte[]` conflates the variant's value and metadata (string dictionary) into one opaque blob. Downstream consumers must know the internal wire format (`[4-byte value length][value bytes][metadata bytes]`) to decode it correctly. 2. **Inconsistent API**: All other complex types in Fluss (e.g., `InternalArray`, `InternalMap`, `InternalRow`) have dedicated first-class types in the row infrastructure, while VARIANT does not. 3. **Poor interoperability with lake formats**: When writing to lake formats (Paimon, Iceberg, Lance), the VARIANT data must be split into separate `value` and `metadata` components. Using `byte[]` forces every integration point to re-implement the split/merge logic. 4. **No alignment with industry standards**: Apache Paimon has already introduced a full `Variant` interface with `value()` and `metadata()` accessors, following the [Variant Binary Encoding spec](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md). Fluss should align with this design for ecosystem consistency. ### Use Case - Users ingesting JSON or semi-structured data into Fluss tables should benefit from efficient binary encoding and per-path access without full deserialization. - Lake connector writers (Paimon, Iceberg, Lance) need structured access to `value` and `metadata` separately. - A first-class `Variant` type enables future optimizations like predicate pushdown on variant paths. ### Solution ### Proposed Design Introduce a first-class `Variant` interface and `GenericVariant` implementation throughout Fluss's row infrastructure, following the same pattern as Apache Paimon's Variant design. #### 1. Core Types - **`Variant` interface** (`fluss-common/.../row/Variant.java`) - `byte[] value()` — returns the binary-encoded variant value (header + data) - `byte[] metadata()` — returns the string dictionary (version + deduplicated object key names) - `long sizeInBytes()` — total byte size - `Variant copy()` — deep copy - Static helpers: `bytesToVariant(byte[])` and `variantToBytes(Variant)` for backward-compatible wire format conversion - **`GenericVariant` class** (`fluss-common/.../row/GenericVariant.java`) - Implements `Variant` and `Serializable` - Stores two `byte[]` fields: `value` and `metadata` - Proper `equals()`, `hashCode()`, `toString()` #### 2. Row Infrastructure Changes | Layer | Change | |-------|--------| | **DataGetters** | Add `Variant getVariant(int pos)` | | **BinaryWriter** | Add `writeVariant(int pos, Variant value)` | | **All InternalRow implementations** | Implement `getVariant()` — `GenericRow`, `BinaryRow`, `CompactedRow`, `IndexedRow`, `ProjectedRow`, `PaddingRow`, `ColumnarRow`, etc. | | **All InternalArray implementations** | Implement `getVariant()` — `GenericArray`, `BinaryArray`, `ColumnarArray` | | **Readers/Writers** | `CompactedRowReader/Writer`, `IndexedRowReader/Writer` — add `readVariant()`/`writeVariant(Variant)` | #### 3. Binary Storage Format (Backward Compatible) The on-wire format remains unchanged for compatibility: `Variant.variantToBytes()` and `Variant.bytesToVariant()` handle the conversion. #### 4. Integration Points - **Lake connectors** (Paimon, Iceberg, Lance): Encoders/decoders use `Variant` directly instead of raw `byte[]` - **Flink bridge**: `FlussRowToFlinkRowConverter` converts `Variant` → `byte[]` for Flink compatibility - **Client converters**: `PojoToRowConverter` / `RowToPojoConverter` support both `byte[]` and `Variant` inputs - **Utilities**: `InternalRowUtils`, `TypeUtils`, `PartitionUtils` updated accordingly #### 5. References - [Variant Binary Encoding Spec (Parquet)](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) - [Apache Paimon Variant Implementation](https://github.com/apache/paimon/tree/master/paimon-common/src/main/java/org/apache/paimon/data/variant) - [Apache Spark VARIANT FLIP](https://issues.apache.org/jira/browse/SPARK-45891) ### Anything else? _No response_ ### Willingness to contribute - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
