rahil-c opened a new pull request, #18497:
URL: https://github.com/apache/hudi/pull/18497

   ## Summary
   
   - Translate Hudi's VECTOR logical-type metadata (`hudi_type = 
"VECTOR(dim[,elem])"`) into lance-spark's `arrow.fixed-size-list.size` metadata 
key before calling `LanceArrowUtils.toArrowSchema`, so the Lance writer emits a 
native Arrow `FixedSizeList<Float32|Float64, dim>` (Lance's vector column 
encoding) instead of a plain variable-length list.
   - No change to `LanceFileWriter.open(...)` storage options is needed — Lance 
reads the fixed-size intent from the Arrow schema itself.
   - Fails fast with `HoodieNotSupportedException` for non-ArrayType or 
non-Float/Double element types (matches lance-spark's 
`VectorUtils.shouldBeFixedSizeList`).
   
   ## Why
   
   Hudi's VECTOR logical type (RFC-99) already round-trips correctly on the 
Parquet path via the `hoodie.vector.columns` footer metadata + 
FIXED_LEN_BYTE_ARRAY storage. On the newly-added Lance base-file path, however, 
VECTOR columns silently degraded to plain `List<Float>` / `List<Double>` Arrow 
fields — losing the fixed-size semantics that make Lance's vector column 
encoding useful (tight packing, future vector search, etc.).
   
   Lance-Spark's DDL-level `TBLPROPERTIES ('<col>.arrow.fixed-size-list.size' = 
'128')` knob ultimately just attaches that same `arrow.fixed-size-list.size` 
key to the column's Spark metadata. Since Hudi writes at the file level 
(bypassing Spark DDL), we attach the metadata directly from Hudi's existing 
VECTOR descriptor.
   
   ## Implementation
   
   `HoodieSparkLanceWriter#enrichSparkSchemaForLanceVectors`:
   1. Reuses existing `VectorConversionUtils.detectVectorColumnsFromMetadata` 
to find fields tagged with `hudi_type = VECTOR(...)`.
   2. For each such field, attaches 
`LanceArrowUtils.ARROW_FIXED_SIZE_LIST_SIZE_KEY()` 
(`"arrow.fixed-size-list.size"`) as a Long with the dimension, preserving any 
pre-existing metadata (including `hudi_type`) via 
`MetadataBuilder.withMetadata(...)`.
   3. Validates element type is `FloatType` or `DoubleType`; throws 
`HoodieNotSupportedException` otherwise.
   
   Downstream: `LanceArrowUtils.toArrowSchema(...)` then emits 
`FixedSizeList<elem, dim>`, and `LanceArrowWriter.createFieldWriter` 
automatically selects its `FixedSizeListWriter` branch when it sees the 
matching Arrow vector — no other code changes required.
   
   ## Test plan
   
   Added to `TestLanceDataSource` (parameterized across COW + MOR):
   - [x] `testFloatVectorRoundTrip` — 4-dim FLOAT VECTOR
   - [x] `testDoubleVectorRoundTrip` — 4-dim DOUBLE VECTOR
   - [x] `testMultipleVectorColumns` — two vector columns of different element 
types / dims on the same row
   
   Each test opens the written `.lance` file directly via `LanceFileReader` and 
asserts `field.getType()` is `ArrowType.FixedSizeList` with the expected 
`listSize`. This is the regression guard that would fail pre-fix and passes 
post-fix.
   
   ```
   mvn -pl hudi-spark-datasource/hudi-spark -Pspark3.5,scala-2.12 \
       -Dtest=TestLanceDataSource -DfailIfNoTests=false surefire:test
   ```
   → `Tests run: 24, Failures: 0, Errors: 0, Skipped: 0` (6 new + 18 existing).
   
   ## Out of scope
   
   - INT8 VECTOR support on Lance (lance-spark's `shouldBeFixedSizeList` 
rejects non-Float/Double; would require upstream Lance work or a separate 
encoding).
   - Round-tripping `hudi_type` metadata on the read-side Spark schema (the 
Lance reader currently returns `ArrayType` without the `VECTOR(...)` 
descriptor). Values are preserved; can be added when a concrete downstream 
caller needs it.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to