rahil-c commented on PR #18375: URL: https://github.com/apache/hudi/pull/18375#issuecomment-4232006077
**[High] Recommend splitting ExpressionPayload merge fixes into a separate PR** The changes to `UpdateProcessor.handleNonDeletes`, `HoodieIndexUtils.inferPartitionPath`, and `SparkFileFormatInternalRecordContext` are not Lance-specific — they affect Parquet and HFile equally. `ExpressionPayload` is used by `MERGE INTO` for all file formats, `UpdateProcessor` runs for all MOR reads, and `SparkFileFormatInternalRecordContext` is the shared trait used by the Parquet reader context too. Recommend splitting into: - **PR A (low risk):** Lance InputFormat plumbing + record type resolution + misc fixes - **PR B (higher risk):** ExpressionPayload merge fixes — needs dedicated Parquet `MERGE INTO` test coverage and makes regressions easier to bisect With just PR A, these would work end-to-end: - `CREATE TABLE` - `CTAS` (CREATE TABLE AS SELECT) - `INSERT INTO` - `SELECT` / read queries - `count(*)` And then for the remaining issues we can follow up in PR B: - **`MERGE INTO`** — uses `ExpressionPayload`, needs the Layer 3 fixes - **`UPDATE` / `DELETE` on COW** — the `inferPartitionPath` fix is needed when the merged record isn't identity-equal to incoming or existing. The PR's own comment explains: *"for payload-based records, `incoming.getData()` is the payload object, not the InternalRow"* — so the identity check falls through to the schema conversion path, which breaks without the fix - **MOR snapshot reads after updates** — `UpdateProcessor.handleNonDeletes` is needed to correctly process payload-based records during log file merging Note: @yihua @wombatu-kun Let me know if this split would work rather than doing this in one PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
