mdashti opened a new pull request, #23104: URL: https://github.com/apache/datafusion/pull/23104
## Which issue does this close? Closes #23103. ## Rationale for this change A hash join pushes a build-side dynamic filter (`key IN build_keys`) down to the probe scan. For a null-aware anti join (`NOT IN`), that filter drops the probe's NULL rows. But `NOT IN` three-valued logic needs a probe-side NULL to collapse the whole result to zero rows. With the NULL filtered away at the scan, before the join's null-aware check runs, the join returns rows that shouldn't be there. ## What changes are included in this PR? `SharedBuildAccumulator::build_filter` now ORs `probe_key IS NULL` into the pushed predicate when the join is `null_aware`. Non-NULL probe rows still get filtered, so the optimization stays. `HashJoinExec`'s `null_aware` validation already guarantees a single probe key. ## Are these changes tested? Yes. Added a parquet-backed case to `null_aware_anti_join.slt`. The existing cases use in-memory `VALUES`, whose scans never apply the pushed filter, so they passed despite the bug. The new one sets `parquet.pushdown_filters = true` so the filter runs row-level. Without the fix it returns `1, 3`; with it, zero rows. ## Are there any user-facing changes? A `NOT IN` over a NULL-bearing inner now returns zero rows instead of leaking rows, when join dynamic filter pushdown and row-level scan filtering are both on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
