neilconway commented on code in PR #23104:
URL: https://github.com/apache/datafusion/pull/23104#discussion_r3460076957
##########
datafusion/sqllogictest/test_files/null_aware_anti_join.slt:
##########
@@ -451,3 +451,68 @@ DROP TABLE customers_test;
statement ok
DROP TABLE all_null_banned;
+
+#############
+## Test: dynamic filter pushdown must not drop the inner NULL.
Review Comment:
```suggestion
## Test: dynamic filter pushdown must not drop inner (probe-side) NULLs.
```
##########
datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs:
##########
@@ -685,12 +691,35 @@ impl SharedBuildAccumulator {
)?) as Arc<dyn PhysicalExpr>
};
- self.dynamic_filter.update(filter_expr)?;
+ self.dynamic_filter
+ .update(self.null_aware_filter(filter_expr))?;
}
}
Ok(())
}
+
+ /// Wraps a pushdown filter so a null-aware anti join keeps its probe-side
NULL rows.
+ ///
+ /// The build-side predicate drops probe rows whose key is NULL, but `NOT
IN` three-valued
+ /// logic needs that NULL to reach the join. OR-ing `probe_key IS NULL`
preserves the dynamic
+ /// filter's selectivity for non-NULL rows while letting the NULL through.
+ fn null_aware_filter(
+ &self,
+ filter_expr: Arc<dyn PhysicalExpr>,
+ ) -> Arc<dyn PhysicalExpr> {
+ if !self.null_aware {
+ return filter_expr;
+ }
+ // A null-aware anti join is validated to a single probe key.
Review Comment:
```suggestion
debug_assert_eq!(
self.on_right.len(),
1,
"null_aware anti join must have exactly one probe key"
);
```
##########
datafusion/sqllogictest/test_files/null_aware_anti_join.slt:
##########
@@ -451,3 +451,68 @@ DROP TABLE customers_test;
statement ok
DROP TABLE all_null_banned;
+
+#############
+## Test: dynamic filter pushdown must not drop the inner NULL.
+## With join dynamic filter pushdown on, the build-side filter pushed to the
probe scan would drop
+## the inner NULL, but NOT IN three-valued logic needs it to collapse the
result to zero rows. The
+## in-memory VALUES scans above never apply the pushed filter, so this case
needs a parquet scan.
+#############
+
+statement ok
+set datafusion.optimizer.enable_join_dynamic_filter_pushdown = true;
+
+# Row-level parquet filtering, so the pushed filter actually drops matching
rows instead of only
+# pruning row groups. Without this the single row group is read whole and the
NULL never gets dropped.
+statement ok
+set datafusion.execution.parquet.pushdown_filters = true;
+
+statement ok
+CREATE TABLE asa_outer(id INT) AS VALUES (1), (2), (3);
+
+statement ok
+CREATE TABLE asa_inner(eid INT) AS VALUES (2), (NULL);
+
+query I
+COPY asa_outer TO 'test_files/scratch/null_aware_anti_join/asa_outer.parquet'
STORED AS PARQUET;
+----
+3
+
+query I
+COPY asa_inner TO 'test_files/scratch/null_aware_anti_join/asa_inner.parquet'
STORED AS PARQUET;
+----
+2
+
+statement ok
+CREATE EXTERNAL TABLE asa_outer_parquet(id INT)
+STORED AS PARQUET
+LOCATION 'test_files/scratch/null_aware_anti_join/asa_outer.parquet';
+
+statement ok
+CREATE EXTERNAL TABLE asa_inner_parquet(eid INT)
+STORED AS PARQUET
+LOCATION 'test_files/scratch/null_aware_anti_join/asa_inner.parquet';
+
+# Expected: zero rows. Before the fix the pushed dynamic filter dropped the
inner NULL, so the join
+# wrongly returned id = 1 and id = 3.
+query I
+SELECT id FROM asa_outer_parquet WHERE id NOT IN (SELECT eid FROM
asa_inner_parquet) ORDER BY id;
+----
+
+statement ok
+DROP TABLE asa_outer;
+
+statement ok
+DROP TABLE asa_inner;
+
+statement ok
+DROP TABLE asa_outer_parquet;
+
+statement ok
+DROP TABLE asa_inner_parquet;
+
+statement ok
+set datafusion.execution.parquet.pushdown_filters = false;
+
+statement ok
+set datafusion.optimizer.enable_join_dynamic_filter_pushdown = true;
Review Comment:
Confusing that we explicitly set this to `true` at the top of the file and
then again to `true` at the bottom of the file. Use `RESET`?
##########
datafusion/sqllogictest/test_files/null_aware_anti_join.slt:
##########
@@ -451,3 +451,68 @@ DROP TABLE customers_test;
statement ok
DROP TABLE all_null_banned;
+
+#############
+## Test: dynamic filter pushdown must not drop the inner NULL.
Review Comment:
And adjust the comment below not to talk about "the NULL" -- multiple NULL
values are possible.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]