kosiew opened a new pull request, #21912:
URL: https://github.com/apache/datafusion/pull/21912
## Which issue does this PR close?
* Part of #21910.
---
## Rationale for this change
Execution operators in DataFusion may emit `RecordBatch` instances whose
embedded schema differs from the operator’s declared output schema (e.g.,
mismatched field names in recursive CTEs).
While PR #21770 addressed this for `RecursiveQueryExec`, the fix was
implemented inline and not reusable. This creates a risk of inconsistent
handling across operators and makes auditing more difficult.
This PR introduces a reusable helper to explicitly normalize batch schemas
at the execution layer, ensuring consistency with declared operator contracts
and reducing ad hoc implementations.
---
## What changes are included in this PR?
* Introduced a new helper:
* `normalize_batch_schema(batch, expected_schema)` in
`datafusion/physical-plan/src/common.rs`
* Handles:
* Fast path: identical or structurally equal schemas (no-op)
* Zero-copy schema rebinding when only field names differ
* Error cases for incompatible column counts or data types
* Migrated `RecursiveQueryExec`:
* Replaced inline schema rebinding logic in
`RecursiveQueryStream::push_batch` with the new helper
* Added documentation:
* Explains behavior, guarantees (including zero-copy), and why
`RecordBatch::with_schema` is insufficient
* Re-exported helper:
* `pub use crate::common::normalize_batch_schema;` in `lib.rs`
* Minor cleanup:
* Standardized imports and formatting
---
## Are these changes tested?
Yes. The following unit tests were added in `common.rs`:
* `test_normalize_batch_schema_noop_identical_schema`
* `test_normalize_batch_schema_renames_fields`
* `test_normalize_batch_schema_noop_arc_clone`
* `test_normalize_batch_schema_strips_metadata`
* `test_normalize_batch_schema_error_column_count_mismatch`
* `test_normalize_batch_schema_error_type_mismatch`
* `test_normalize_batch_schema_multi_column_rename`
These tests verify:
* No-op behavior for matching schemas
* Zero-copy renaming when only field names differ
* Correct error handling for incompatible schemas
* Metadata handling behavior
Existing regression tests for recursive CTEs continue to pass via the
migrated call site.
---
## Are there any user-facing changes?
No direct user-facing changes.
This PR improves internal correctness and consistency of execution-layer
schema handling. It may indirectly affect downstream consumers (e.g., CSV/JSON
writers, TopK) by ensuring emitted batches always conform to the declared
schema.
---
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]