kosiew opened a new issue, #22769:
URL: https://github.com/apache/datafusion/issues/22769

   
   ## Summary
   
   `LogicalPlan::Unnest` currently has an inconsistent contract across the 
logical plan expression APIs:
   
   - `LogicalPlan::expressions()` / `apply_expressions()` expose the 
`Unnest::exec_columns` as `Expr::Column` values.
   - `LogicalPlan::with_new_exprs()` rejects expressions for 
`LogicalPlan::Unnest` via `assert_no_expressions(expr)?`.
   
   This means generic optimizer code cannot safely assume that 
`node.with_new_exprs(node.expressions(), new_inputs)` is valid for all logical 
plan nodes. `PushDownLeafProjections` now has to special-case `Unnest` as a 
barrier to avoid an internal error.
   
   ## Motivation
   
   The immediate regression fixed in #22620 was an optimizer failure:
   
   ```text
   Internal error: Assertion failed: expr.is_empty(): Unnest ... should have no 
exprs, got [Column(...)]
   ```
   
   The failure happened because `PushDownLeafProjections` rebuilt a node using:
   
   ```rust
   node.with_new_exprs(node.expressions(), new_inputs)?
   ```
   
   For `Unnest`, `expressions()` returns `exec_columns`, but `with_new_exprs()` 
requires the expression list to be empty. This is surprising for generic 
logical plan transforms and may cause similar bugs in other optimizer rules.
   
   ## Current behavior
   
   Relevant locations:
   
   - `datafusion/expr/src/logical_plan/tree_node.rs`
     - `LogicalPlan::apply_expressions()` maps `Unnest::exec_columns` into 
`Expr::Column` values.
   - `datafusion/expr/src/logical_plan/plan.rs`
     - `LogicalPlan::with_new_exprs()` handles `LogicalPlan::Unnest` by calling 
`assert_no_expressions(expr)?`, then rebuilding from the old `exec_columns` and 
new input.
   - `datafusion/optimizer/src/extract_leaf_expressions.rs`
     - `try_push_into_inputs()` now explicitly returns `Ok(None)` for 
`LogicalPlan::Unnest` to avoid this mismatch.
   
   ## Why this matters
   
   Generic plan transforms often treat `expressions()` and `with_new_exprs()` 
as paired APIs: read expressions, transform or preserve them, then rebuild the 
node with new inputs and expressions. `Unnest` breaks that assumption.
   
   This increases risk that future optimizer changes will:
   
   1. hit internal errors when rebuilding `Unnest`, or
   2. incorrectly treat `Unnest::exec_columns` as normal evaluable expressions, 
even though they are structural operator parameters with special input/output 
column identity semantics.
   
   `Unnest` also has a column-identity hazard: an unnested output column may 
share a name with the input list/struct column but have different type and 
values. Name-based routing is insufficient to decide whether an expression can 
be pushed below `Unnest`.
   
   ## Proposed direction
   
   Make the `Unnest` contract explicit at the logical-plan API layer. Possible 
approaches:
   
   1. **Do not expose `Unnest::exec_columns` through generic expression APIs**
      - Treat them as structural parameters, not normal expressions.
      - `expressions()` would return empty for `Unnest`, matching 
`with_new_exprs()`.
   
   2. **Allow `with_new_exprs()` to accept and validate `Unnest::exec_columns`**
      - Keep exposing them through `expressions()`.
      - Rebuild `Unnest` from the provided columns after validating they are 
plain columns and semantically valid for the new input.
   
   3. **Split structural parameters from evaluable expressions**
      - Add or document a separate API for operator parameters such as 
`Unnest::exec_columns`.
      - Keep optimizer expression rewrites focused on expressions that can be 
transformed and rebound safely.
   
   The best option should preserve current optimizer semantics while making 
invalid generic rewrites hard to write.
   
   ## Acceptance criteria
   
   - The contract between `expressions()` / `apply_expressions()` and 
`with_new_exprs()` is documented and consistent for `LogicalPlan::Unnest`.
   - Generic rebuild patterns either work for `Unnest` or are clearly prevented 
by API shape/documentation.
   - Optimizer rules no longer need to rely on undocumented knowledge that 
`Unnest` exposes expressions but rejects them during rebuild.
   - Existing Unnest behavior and schemas remain unchanged.
   - Regression coverage includes a generic transform/rebuild path involving 
`Unnest`.
   
   ## Suggested tests
   
   - Unit test showing the chosen contract for `LogicalPlan::Unnest`:
     - if `expressions()` returns empty, assert that explicitly;
     - if it returns `exec_columns`, assert `with_new_exprs(expressions(), 
new_inputs)` succeeds for valid input.
   - Optimizer regression test with `PushDownLeafProjections` and `Unnest` 
remains passing.
   - SQLLogicTest coverage for expressions above `Unnest`, including:
     - expression references a sibling input column preserved through `Unnest`;
     - expression references the unnested output column itself.
   
   ## Related PR
   #22620 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to