Re: [PR] feat(lance): Ensure Spark-SQL works correctly when using lance as base file format [hudi]

via GitHub Sun, 12 Apr 2026 10:08:46 -0700


rahil-c commented on PR #18375:
URL: https://github.com/apache/hudi/pull/18375#issuecomment-4232006077


   **[High] Recommend splitting ExpressionPayload merge fixes into a separate 
PR**
   
   The changes to `UpdateProcessor.handleNonDeletes`, 
`HoodieIndexUtils.inferPartitionPath`, and 
`SparkFileFormatInternalRecordContext` are not Lance-specific — they affect 
Parquet and HFile equally. `ExpressionPayload` is used by `MERGE INTO` for all 
file formats, `UpdateProcessor` runs for all MOR reads, and 
`SparkFileFormatInternalRecordContext` is the shared trait used by the Parquet 
reader context too.
   
   Recommend splitting into:
   - **PR A (low risk):** Lance InputFormat plumbing + record type resolution + 
misc fixes
   - **PR B (higher risk):** ExpressionPayload merge fixes — needs dedicated 
Parquet `MERGE INTO` test coverage and makes regressions easier to bisect
   
   With just PR A, these would work end-to-end:
   - `CREATE TABLE`
   - `CTAS` (CREATE TABLE AS SELECT)
   - `INSERT INTO`
   - `SELECT` / read queries
   - `count(*)`
   
   And then for the remaining issues we can follow up in PR B:
   - **`MERGE INTO`** — uses `ExpressionPayload`, needs the Layer 3 fixes
   - **`UPDATE` / `DELETE` on COW** — the `inferPartitionPath` fix is needed 
when the merged record isn't identity-equal to incoming or existing. The PR's 
own comment explains: *"for payload-based records, `incoming.getData()` is the 
payload object, not the InternalRow"* — so the identity check falls through to 
the schema conversion path, which breaks without the fix
   - **MOR snapshot reads after updates** — `UpdateProcessor.handleNonDeletes` 
is needed to correctly process payload-based records during log file merging
   
   
   Note: @yihua @wombatu-kun Let me know if this split would work rather than 
doing this in one PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(lance): Ensure Spark-SQL works correctly when using lance as base file format [hudi]

Reply via email to