mdashti opened a new pull request, #23104:
URL: https://github.com/apache/datafusion/pull/23104

   ## Which issue does this close?
   
   Closes #23103.
   
   ## Rationale for this change
   
   A hash join pushes a build-side dynamic filter (`key IN build_keys`) down to 
the probe scan. For a null-aware anti join (`NOT IN`), that filter drops the 
probe's NULL rows. But `NOT IN` three-valued logic needs a probe-side NULL to 
collapse the whole result to zero rows. With the NULL filtered away at the 
scan, before the join's null-aware check runs, the join returns rows that 
shouldn't be there.
   
   ## What changes are included in this PR?
   
   `SharedBuildAccumulator::build_filter` now ORs `probe_key IS NULL` into the 
pushed predicate when the join is `null_aware`. Non-NULL probe rows still get 
filtered, so the optimization stays. `HashJoinExec`'s `null_aware` validation 
already guarantees a single probe key.
   
   ## Are these changes tested?
   
   Yes. Added a parquet-backed case to `null_aware_anti_join.slt`. The existing 
cases use in-memory `VALUES`, whose scans never apply the pushed filter, so 
they passed despite the bug. The new one sets `parquet.pushdown_filters = true` 
so the filter runs row-level. Without the fix it returns `1, 3`; with it, zero 
rows.
   
   ## Are there any user-facing changes?
   
   A `NOT IN` over a NULL-bearing inner now returns zero rows instead of 
leaking rows, when join dynamic filter pushdown and row-level scan filtering 
are both on.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to