xudong963 opened a new pull request, #21907: URL: https://github.com/apache/datafusion/pull/21907
## Which issue does this PR close? - Related to #21637 ## Rationale for this change This is split out from review feedback on #21637. Row groups can only be marked fully matched when all rows are guaranteed to pass the filter. For nullable predicate columns, proving `NOT(predicate)` is not enough because rows where the predicate evaluates to NULL do not pass the filter. ## What changes are included in this PR? This PR makes the fully matched row-group proof conservative for nulls by adding `IS NULL` checks for nullable columns referenced by the predicate before evaluating the inverted pruning predicate. It also threads `with_missing_null_counts_as_zero` through `RowGroupPruningStatistics` so normal row-group pruning keeps the existing default behavior, while fully matched proofs treat missing null counts as unknown. This reuses the existing statistics conversion path instead of adding a separate null-count conversion pass. ## Are these changes tested? Added a regression test covering row groups with known nulls, known zero nulls, and missing null counts. Passed locally: - `cargo fmt --all` - `cargo fmt --all -- --check` - `git diff --check` - `cargo test -p datafusion-datasource-parquet row_group_fully_matched_requires_known_non_null_predicate_columns` - `cargo test -p datafusion-datasource-parquet row_group_pruning_predicate` - `cargo clippy -p datafusion-datasource-parquet --tests -- -D warnings` Also attempted locally: - `cargo test -p datafusion-datasource-parquet row_group_filter::tests` failed because `PARQUET_TEST_DATA` / `parquet-testing/data` is not available in this checkout. - `cargo clippy --all-targets --all-features -- -D warnings` failed when the local filesystem ran out of space writing `target/debug` metadata after checking `datafusion-datasource-parquet`. ## Are there any user-facing changes? No API changes. This only prevents false positives in the row-group fully matched optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
