mbutrovich opened a new pull request, #21660: URL: https://github.com/apache/datafusion/pull/21660
## Which issue does this PR close? - Closes #. ## Rationale for this change While enabling `SortMergeJoinExec` with filters in [DataFusion Comet](https://github.com/apache/datafusion-comet), I hit a correctness bug in DataFusion's `SortMergeJoinExec` for full outer joins with nullable filter columns. The bug was originally surfaced via the [SPARK-43113](https://issues.apache.org/jira/browse/SPARK-43113) reproducer. When a join filter expression evaluates to `NULL` (e.g., `l.b < (r.b + 1)` where `r.b` is `NULL`), the full outer join treats the row pair as **matched** instead of **unmatched**. Per SQL semantics, `NULL` in a boolean filter context is `false` (not satisfied), so the rows should be emitted as separate unmatched rows. The bug has been present since filtered full outer join support was added to SMJ in #12764 / #13369. It was never caught because: 1. The join fuzz tests generate filter column data with `Int32Array::from_iter_values()`, which never produces `NULL` values. 2. No existing unit test or sqllogictest exercised a full outer join filter that evaluates to `NULL`. ## What changes are included in this PR? **Root cause:** The full outer join code path had a special case that preserved raw `NULL` values from the filter expression result (`pre_mask`) instead of converting them to `false` via `prep_null_mask_filter` like left/right outer joins do. This caused two problems: 1. **Streamed (left) side:** In `get_corrected_filter_mask()`, `NULL` entries in `filter_mask` are treated as "pass through" (for pre-null-joined rows from `append_nulls()`). But when the filter expression itself returns `NULL`, those entries also appear as `NULL` in the mask — and get incorrectly treated as matched. This produced wrong join output (matched rows instead of unmatched). 2. **Buffered (right) side:** `BooleanArray::value()` was called on `NULL` positions in `pre_mask` to update `FilterState`. At NULL positions, the values buffer contains a deterministic but semantically meaningless result (computed from the default zero-storage of NULL inputs). For some rows this value happens to be `true`, which incorrectly marks unmatched buffered rows as `SomePassed` and silently drops them from the output. **Fix:** Remove the full outer join special case in `materializing_stream.rs`. All outer join types now uniformly use the null-corrected `mask` (where `NULL` → `false` via `prep_null_mask_filter`) for both deferred filtering metadata and `FilterState` tracking. Semi/anti/mark joins are unaffected — they use `BitwiseSortMergeJoinStream` which already converts NULLs to `false`. **Tests:** - New unit test `join_full_null_filter_result` reproducing the SPARK-43113 scenario with a nullable right-side column. - Modified `make_staggered_batches_i32` in `join_fuzz.rs` to inject ~10% `NULL` values into the filter column (`x`), so the fuzz tests exercise `NULL` filter handling across all join types. ## Are these changes tested? Yes. - New unit test (`join_full_null_filter_result`) directly reproduces the bug. - Existing 57 SMJ unit tests all pass. - All 41 join fuzz tests pass with the new nullable filter column data, including `test_full_join_1k_filtered` which compares `HashJoinExec` vs `SortMergeJoinExec` and would have caught this bug if the fuzz data had included `NULL`s. - Will run 100 iterations of the fuzz tests overnight to shake out any remaining nondeterministic issues. ## Are there any user-facing changes? Full outer sort-merge joins with filters involving nullable columns now produce correct results. Previously, rows where the filter evaluated to `NULL` were incorrectly included as matched; they are now correctly emitted as unmatched (null-joined) rows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
