Re: [PR] feat(lance): Ensure Spark-SQL works correctly when using lance as base file format [hudi]

via GitHub Mon, 13 Apr 2026 00:52:43 -0700


wombatu-kun commented on PR #18375:
URL: https://github.com/apache/hudi/pull/18375#issuecomment-4234709359


   Thanks for the careful read, @rahil-c — but a categorical **no** on 
splitting this PR, and I want to explain why.
   
   **The work is already complete and mutually necessary.** The 
ExpressionPayload merge fixes (UpdateProcessor, inferPartitionPath, 
SparkFileFormatInternalRecordContext) were not tacked on — they were discovered 
*while* enabling Lance as a base file format. Lance forces SPARK record type 
end-to-end, which made latent bugs in the Avro-payload path surface immediately 
for MERGE INTO / UPDATE / DELETE / MOR reads. Landing just "PR A" means Lance 
passes only the happy-path SQL (CREATE TABLE / CTAS / INSERT / SELECT / 
count(\*)) and silently regresses on everything else — exactly the operations 
users will reach for next. Shipping that partial state is worse than shipping 
nothing.
   
   **Splitting now is pure overhead, not safety.** The test matrix that 
actually exercises these production-code paths is the parametrized one in this 
PR (`TestSparkSqlCoreFlow`, `TestSqlStatement`, 
`TestQueryMergeOnReadOptimizedTable`) — which runs each scenario against both 
parquet and lance. To split, I would have to:
   
   1. Re-derive which production hunk is required for which parquet-only vs 
lance-only test subset, something the current combined state makes trivially 
visible in one diff but would become archaeology across two PRs;
   2. Invent artificial parquet-only regression tests for ExpressionPayload 
fixes that the current suite already covers (parquet runs side-by-side with 
lance in every parametrized case);
   3. Rebase / re-test / re-review twice, doubling CI load and reviewer 
round-trips on code that is already green.
   
   None of that work makes the change safer — it just delays it.
   
   **Bisectability is already fine.** Each production-code fix is a 
self-contained hunk with a focused commit (`fixed by review notes`, `fixed by 
review notes 2`, etc. on this branch), and the PR description + review threads 
make it explicit *why* each fix exists and which symptom it addresses. If a 
regression is reported on parquet MERGE INTO, the blame points at a single hunk 
of `UpdateProcessor.handleNonDeletes` regardless of whether it arrived in one 
PR or two.
   
   **Parquet MERGE INTO coverage already exists** — the parametrized 
`TestSqlStatement` and `TestSparkSqlCoreFlow` runs cover MERGE INTO / UPDATE / 
DELETE with parquet in every config row. The combined suite is the regression 
gate for the ExpressionPayload fixes.
   
   Happy to keep iterating on review comments (all six threads from yesterday 
are now addressed in the latest push), but I'd like to land this as one unit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(lance): Ensure Spark-SQL works correctly when using lance as base file format [hudi]

Reply via email to