rahil-c opened a new pull request, #18497:
URL: https://github.com/apache/hudi/pull/18497
## Summary
- Translate Hudi's VECTOR logical-type metadata (`hudi_type =
"VECTOR(dim[,elem])"`) into lance-spark's `arrow.fixed-size-list.size` metadata
key before calling `LanceArrowUtils.toArrowSchema`, so the Lance writer emits a
native Arrow `FixedSizeList<Float32|Float64, dim>` (Lance's vector column
encoding) instead of a plain variable-length list.
- No change to `LanceFileWriter.open(...)` storage options is needed — Lance
reads the fixed-size intent from the Arrow schema itself.
- Fails fast with `HoodieNotSupportedException` for non-ArrayType or
non-Float/Double element types (matches lance-spark's
`VectorUtils.shouldBeFixedSizeList`).
## Why
Hudi's VECTOR logical type (RFC-99) already round-trips correctly on the
Parquet path via the `hoodie.vector.columns` footer metadata +
FIXED_LEN_BYTE_ARRAY storage. On the newly-added Lance base-file path, however,
VECTOR columns silently degraded to plain `List<Float>` / `List<Double>` Arrow
fields — losing the fixed-size semantics that make Lance's vector column
encoding useful (tight packing, future vector search, etc.).
Lance-Spark's DDL-level `TBLPROPERTIES ('<col>.arrow.fixed-size-list.size' =
'128')` knob ultimately just attaches that same `arrow.fixed-size-list.size`
key to the column's Spark metadata. Since Hudi writes at the file level
(bypassing Spark DDL), we attach the metadata directly from Hudi's existing
VECTOR descriptor.
## Implementation
`HoodieSparkLanceWriter#enrichSparkSchemaForLanceVectors`:
1. Reuses existing `VectorConversionUtils.detectVectorColumnsFromMetadata`
to find fields tagged with `hudi_type = VECTOR(...)`.
2. For each such field, attaches
`LanceArrowUtils.ARROW_FIXED_SIZE_LIST_SIZE_KEY()`
(`"arrow.fixed-size-list.size"`) as a Long with the dimension, preserving any
pre-existing metadata (including `hudi_type`) via
`MetadataBuilder.withMetadata(...)`.
3. Validates element type is `FloatType` or `DoubleType`; throws
`HoodieNotSupportedException` otherwise.
Downstream: `LanceArrowUtils.toArrowSchema(...)` then emits
`FixedSizeList<elem, dim>`, and `LanceArrowWriter.createFieldWriter`
automatically selects its `FixedSizeListWriter` branch when it sees the
matching Arrow vector — no other code changes required.
## Test plan
Added to `TestLanceDataSource` (parameterized across COW + MOR):
- [x] `testFloatVectorRoundTrip` — 4-dim FLOAT VECTOR
- [x] `testDoubleVectorRoundTrip` — 4-dim DOUBLE VECTOR
- [x] `testMultipleVectorColumns` — two vector columns of different element
types / dims on the same row
Each test opens the written `.lance` file directly via `LanceFileReader` and
asserts `field.getType()` is `ArrowType.FixedSizeList` with the expected
`listSize`. This is the regression guard that would fail pre-fix and passes
post-fix.
```
mvn -pl hudi-spark-datasource/hudi-spark -Pspark3.5,scala-2.12 \
-Dtest=TestLanceDataSource -DfailIfNoTests=false surefire:test
```
→ `Tests run: 24, Failures: 0, Errors: 0, Skipped: 0` (6 new + 18 existing).
## Out of scope
- INT8 VECTOR support on Lance (lance-spark's `shouldBeFixedSizeList`
rejects non-Float/Double; would require upstream Lance work or a separate
encoding).
- Round-tripping `hudi_type` metadata on the read-side Spark schema (the
Lance reader currently returns `ArrayType` without the `VECTOR(...)`
descriptor). Values are preserved; can be added when a concrete downstream
caller needs it.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]