kosiew opened a new issue, #22772:
URL: https://github.com/apache/datafusion/issues/22772

   ## Summary
   
   Recent changes in #21666 extracted key composition helpers in hash-join 
dynamic filtering (membership, bounds, and membership+bounds combination). 
However, `SharedBuildAccumulator::build_filter` still owns substantial 
expression-shape branching, fallback policy, and mode-specific assembly 
behavior.
   
   This follow-up focuses on moving finalize-time expression policy into a 
dedicated composer/helper so accumulator finalization stays lifecycle-focused. 
The goal is not to force identical expression shapes for all modes, but to make 
the mode-specific rules explicit and centralized.
   
   ## Current State
   What is already in place:
   1. Helper-level extraction exists for:
      - membership predicate construction
      - bounds predicate construction
      - membership+bounds combination
   2. Dynamic-filter update coordination remains in `SharedBuildAccumulator`.
   
   What still needs refinement:
   1. `CollectLeft` and `Partitioned` finalize paths still repeat the 
membership/bounds composition flow.
   2. `build_filter` still mixes finalization/lifecycle concerns with 
expression-shape and fallback policy.
   3. Test coverage in this area is currently stronger for lifecycle/reporting 
behavior than for expression-policy parity and edge cases.
   
   ## Problem Statement
   The remaining coupling makes expression-shape changes riskier than necessary:
   - mode-specific branching and fallback expression decisions are embedded in 
accumulator finalization
   - composition behavior is partially duplicated between finalize paths
   - there is no focused guard ensuring equivalent policy where inputs are 
equivalent, while preserving legitimate mode-specific differences
   
   ## Goals
   - Extract finalize-time expression policy into a dedicated composer/builder 
layer.
   - Keep `SharedBuildAccumulator` focused on synchronization, completion, and 
update orchestration.
   - Remove repeated membership/bounds composition flow between `CollectLeft` 
and `Partitioned` paths.
   - Keep partitioned routing, empty-partition handling, and canceled/unknown 
fallbacks explicit.
   - Add explicit tests for policy parity and key edge cases.
   
   ## Non-Goals
   - Changing dynamic-filter query semantics.
   - Reworking hash-join state-machine behavior beyond this separation.
   - Planner-level changes unrelated to dynamic-filter expression composition.
   
   ## Proposed Follow-up Design
   Introduce a focused composition API for finalize-time expression assembly 
(name can vary, for example `DynamicFilterExprComposer` or 
`BuildSideFilterComposer`).
   
   Responsibilities:
   1. Accept normalized inputs per partition/finalize mode (membership, bounds, 
partition status metadata).
   2. Produce final expression outcomes from one explicit policy surface for 
both `CollectLeft` and `Partitioned`.
   3. Encapsulate fallback behavior (for example no-op, true/false branch 
defaults, case branch handling) in one place.
   4. Preserve legitimate mode-specific output shapes, especially partitioned 
`CASE` routing.
   5. Keep return values explicit for "no expression", "direct expression", and 
"composed/case expression" outcomes.
   
   `SharedBuildAccumulator` should:
   - collect and normalize finalize inputs
   - call the composer
   - apply `dynamic_filter.update(...)` when composition yields an expression
   - avoid embedding expression-shape policy directly in finalization control 
flow
   
   ## Implementation Plan
   1. Extract remaining mode-specific composition branches from `build_filter` 
into a helper/composer.
   2. Refactor both `CollectLeft` and `Partitioned` finalize paths to delegate 
composition.
   3. Keep state/lifecycle transitions unchanged.
   4. Add tests that assert expression-shape parity and edge-case behavior.
   5. Run targeted hash-join and sqllogictest coverage to confirm no semantic 
drift.
   
   ## Testing Strategy
   Add focused tests for composition behavior:
   - Policy parity test: equivalent logical inputs across `CollectLeft` and 
`Partitioned` follow equivalent membership/bounds/fallback rules, allowing 
expected mode-specific wrappers such as partitioned `CASE` routing.
   - Edge cases:
     - missing membership predicate
     - missing/partial bounds
     - empty partition handling
     - canceled/unknown partition handling where relevant
     - no-branch fallbacks
   - Keep existing lifecycle/reporting tests intact and passing.
   - Ensure existing hash-join dynamic-filter and SQL behavior tests continue 
to pass unchanged.
   
   ## Acceptance Criteria
   - `build_filter` no longer contains substantial expression-shape/fallback 
policy beyond delegating to the composer/helper and applying updates.
   - Finalize-time expression composition policy is centralized behind a 
dedicated helper/composer API.
   - New tests cover policy parity and edge-case composition behavior.
   - Existing dynamic-filter behavior/tests pass unchanged.
   - Readability improves by clearly separating synchronization logic from 
expression construction logic.
   
   ## References
   - Related area:
     - `datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs`
     - `datafusion/physical-plan/src/joins/hash_join/stream.rs`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to