1fanwang opened a new pull request, #21870:
URL: https://github.com/apache/datafusion/pull/21870

   ## Which issue does this PR close?
   
   - Closes #14943.
   
   ## Rationale for this change
   
   `SimplifyExpressions` disables canonicalization for `LogicalPlan::Join` (see 
#8780), so the AND/OR dedup in `expr_contains_inner` -- which walks each 
conjunct/disjunct chain and compares each leaf with structural `==` -- cannot 
recognize duplicate predicates that differ only by commutative operand order 
(e.g. `A = B` vs `B = A`).
   
   Workloads that simplify expressions inside a join filter (delta-rs MERGE is 
the case reported in #14943) end up retaining the duplicate even after multiple 
simplifier cycles, because no canonicalization step normalizes the operand 
order first.
   
   `@alamb` suggested the right shape in the issue thread:
   
   > We simplify A = B and B = A in `common_sub_expression_eliminate` but not 
`simplify_expressions`. I wonder if this is another example of code that would 
be beneficial to pull into simplify_expressions 🤔
   
   This PR does exactly that: routes the leaf comparison through the 
`NormalizeEq` trait that CSE already uses.
   
   ## What changes are included in this PR?
   
   - `expr_contains_inner` (in `simplify_expressions/utils.rs`) compares the 
leaf via `Expr::normalize_eq` instead of `==`. `NormalizeEq` handles `+`, `*`, 
`&`, `|`, `^`, `=`, `!=` commutatively and falls through to structural `==` for 
everything else, so:
     - Existing rules for non-commutative operators are unchanged.
     - The existing `!needle.is_volatile()` guard in `expr_contains` is 
unchanged.
   - New regression test 
`test_simplify_swapped_operands_in_and_or_no_canonicalize` covers:
     - `c = lit AND lit = c` --> `c = lit`
     - `lit = c AND c = lit` --> `lit = c`
     - `c = lit OR lit = c` --> `c = lit`
     - 3-conjunct case nested under `(a AND b) AND c` shape
   
   ## Are these changes tested?
   
   Yes. The new regression test fails on `main` (input passes through 
unchanged) and passes after the fix. Full optimizer suite (`cargo test -p 
datafusion-optimizer --lib` -- 682 tests) and workspace lint (`cargo clippy 
--workspace --lib --tests`) pass cleanly.
   
   ## Are there any user-facing changes?
   
   No public API changes. Behaviorally, AND/OR chains containing 
commutative-equivalent duplicates now collapse correctly even when the 
simplifier's canonicalizer is disabled (currently only the `LogicalPlan::Join` 
case). Existing canonicalize-on paths produce the same output as before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to