wombatu-kun commented on PR #18375: URL: https://github.com/apache/hudi/pull/18375#issuecomment-4234709359
Thanks for the careful read, @rahil-c — but a categorical **no** on splitting this PR, and I want to explain why. **The work is already complete and mutually necessary.** The ExpressionPayload merge fixes (UpdateProcessor, inferPartitionPath, SparkFileFormatInternalRecordContext) were not tacked on — they were discovered *while* enabling Lance as a base file format. Lance forces SPARK record type end-to-end, which made latent bugs in the Avro-payload path surface immediately for MERGE INTO / UPDATE / DELETE / MOR reads. Landing just "PR A" means Lance passes only the happy-path SQL (CREATE TABLE / CTAS / INSERT / SELECT / count(\*)) and silently regresses on everything else — exactly the operations users will reach for next. Shipping that partial state is worse than shipping nothing. **Splitting now is pure overhead, not safety.** The test matrix that actually exercises these production-code paths is the parametrized one in this PR (`TestSparkSqlCoreFlow`, `TestSqlStatement`, `TestQueryMergeOnReadOptimizedTable`) — which runs each scenario against both parquet and lance. To split, I would have to: 1. Re-derive which production hunk is required for which parquet-only vs lance-only test subset, something the current combined state makes trivially visible in one diff but would become archaeology across two PRs; 2. Invent artificial parquet-only regression tests for ExpressionPayload fixes that the current suite already covers (parquet runs side-by-side with lance in every parametrized case); 3. Rebase / re-test / re-review twice, doubling CI load and reviewer round-trips on code that is already green. None of that work makes the change safer — it just delays it. **Bisectability is already fine.** Each production-code fix is a self-contained hunk with a focused commit (`fixed by review notes`, `fixed by review notes 2`, etc. on this branch), and the PR description + review threads make it explicit *why* each fix exists and which symptom it addresses. If a regression is reported on parquet MERGE INTO, the blame points at a single hunk of `UpdateProcessor.handleNonDeletes` regardless of whether it arrived in one PR or two. **Parquet MERGE INTO coverage already exists** — the parametrized `TestSqlStatement` and `TestSparkSqlCoreFlow` runs cover MERGE INTO / UPDATE / DELETE with parquet in every config row. The combined suite is the regression gate for the ExpressionPayload fixes. Happy to keep iterating on review comments (all six threads from yesterday are now addressed in the latest push), but I'd like to land this as one unit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
