RatulDawar commented on code in PR #23106:
URL: https://github.com/apache/datafusion/pull/23106#discussion_r3462118015


##########
datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs:
##########
@@ -685,12 +699,49 @@ impl SharedBuildAccumulator {
                     )?) as Arc<dyn PhysicalExpr>
                 };
 
-                self.dynamic_filter.update(filter_expr)?;
+                self.dynamic_filter
+                    .update(self.preserve_probe_nulls(filter_expr))?;
             }
         }
 
         Ok(())
     }
+
+    /// Keeps probe rows with a NULL key when the join semantics need them.
+    ///
+    /// The build-side predicate drops probe rows whose key is NULL. A 
null-aware anti join
+    /// (`NOT IN`) needs that NULL to reach the join so three-valued logic can 
collapse the
+    /// result, and a null-equal join needs it to match a build-side NULL. 
OR-ing `key IS NULL`
+    /// keeps those rows while preserving the filter's selectivity for the 
rest; the join refines
+    /// whatever the widened filter lets through.
+    fn preserve_probe_nulls(
+        &self,
+        filter_expr: Arc<dyn PhysicalExpr>,
+    ) -> Arc<dyn PhysicalExpr> {
+        if self.null_equality != NullEquality::NullEqualsNull && 
!self.null_aware {
+            return filter_expr;
+        }
+        // Only a key that can actually be NULL needs the disjunct; a NOT NULL 
key never widens.
+        // Null-aware joins are single-key; null-equal joins can be multi-key, 
so OR every nullable
+        // key. If every key is NOT NULL the filter is left untouched, at full 
selectivity.
+        let any_key_is_null = self
+            .on_right
+            .iter()
+            .filter(|key| key.nullable(&self.probe_schema).unwrap_or(true))

Review Comment:
   Should we widen when we are unable to check column nullability ? i.e. 
unwrap_or(true). 
   From what I see this can only happen when on_right and schema are out of 
sync which seems to be an invalid state ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to