kosiew opened a new pull request, #21912:
URL: https://github.com/apache/datafusion/pull/21912

   ## Which issue does this PR close?
   
   * Part of #21910.
   
   ---
   
   ## Rationale for this change
   
   Execution operators in DataFusion may emit `RecordBatch` instances whose 
embedded schema differs from the operator’s declared output schema (e.g., 
mismatched field names in recursive CTEs).
   
   While PR #21770 addressed this for `RecursiveQueryExec`, the fix was 
implemented inline and not reusable. This creates a risk of inconsistent 
handling across operators and makes auditing more difficult.
   
   This PR introduces a reusable helper to explicitly normalize batch schemas 
at the execution layer, ensuring consistency with declared operator contracts 
and reducing ad hoc implementations.
   
   ---
   
   ## What changes are included in this PR?
   
   * Introduced a new helper:
   
     * `normalize_batch_schema(batch, expected_schema)` in 
`datafusion/physical-plan/src/common.rs`
     * Handles:
   
       * Fast path: identical or structurally equal schemas (no-op)
       * Zero-copy schema rebinding when only field names differ
       * Error cases for incompatible column counts or data types
   
   * Migrated `RecursiveQueryExec`:
   
     * Replaced inline schema rebinding logic in 
`RecursiveQueryStream::push_batch` with the new helper
   
   * Added documentation:
   
     * Explains behavior, guarantees (including zero-copy), and why 
`RecordBatch::with_schema` is insufficient
   
   * Re-exported helper:
   
     * `pub use crate::common::normalize_batch_schema;` in `lib.rs`
   
   * Minor cleanup:
   
     * Standardized imports and formatting
   
   ---
   
   ## Are these changes tested?
   
   Yes. The following unit tests were added in `common.rs`:
   
   * `test_normalize_batch_schema_noop_identical_schema`
   * `test_normalize_batch_schema_renames_fields`
   * `test_normalize_batch_schema_noop_arc_clone`
   * `test_normalize_batch_schema_strips_metadata`
   * `test_normalize_batch_schema_error_column_count_mismatch`
   * `test_normalize_batch_schema_error_type_mismatch`
   * `test_normalize_batch_schema_multi_column_rename`
   
   These tests verify:
   
   * No-op behavior for matching schemas
   * Zero-copy renaming when only field names differ
   * Correct error handling for incompatible schemas
   * Metadata handling behavior
   
   Existing regression tests for recursive CTEs continue to pass via the 
migrated call site.
   
   ---
   
   ## Are there any user-facing changes?
   
   No direct user-facing changes.
   
   This PR improves internal correctness and consistency of execution-layer 
schema handling. It may indirectly affect downstream consumers (e.g., CSV/JSON 
writers, TopK) by ensuring emitted batches always conform to the declared 
schema.
   
   ---
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to