mdashti opened a new issue, #23103: URL: https://github.com/apache/datafusion/issues/23103
# What happens? A null-aware anti join (`NOT IN`) returns rows when its inner side has a NULL, instead of zero. With `enable_join_dynamic_filter_pushdown` on, `HashJoinExec` pushes a build-side filter (`key IN build_keys`) to the probe scan. That filter drops the probe's NULL row before the join's null-aware check runs, so `NOT IN` three-valued logic never collapses the result to empty. It only shows when the probe scan applies the filter at row level, e.g. parquet with `pushdown_filters = true`. An in-memory scan that ignores the pushed filter still returns the correct empty result. Reproduced on `54.0.0` and current `main`. ## To Reproduce ```sql set datafusion.optimizer.enable_join_dynamic_filter_pushdown = true; set datafusion.execution.parquet.pushdown_filters = true; create table outer_t(id int) as values (1), (2), (3); create table inner_t(eid int) as values (2), (null); copy outer_t to '/tmp/outer.parquet' stored as parquet; copy inner_t to '/tmp/inner.parquet' stored as parquet; create external table outer_p(id int) stored as parquet location '/tmp/outer.parquet'; create external table inner_p(eid int) stored as parquet location '/tmp/inner.parquet'; select id from outer_p where id not in (select eid from inner_p) order by id; ``` Expected zero rows: a NULL in the inner set makes every `NOT IN` comparison unknown. Instead it returns: ``` +----+ | id | +----+ | 1 | | 3 | +----+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
