xudong963 opened a new pull request, #21907:
URL: https://github.com/apache/datafusion/pull/21907

   ## Which issue does this PR close?
   
   - Related to #21637
   
   ## Rationale for this change
   
   This is split out from review feedback on #21637. Row groups can only be 
marked fully matched when all rows are guaranteed to pass the filter. For 
nullable predicate columns, proving `NOT(predicate)` is not enough because rows 
where the predicate evaluates to NULL do not pass the filter.
   
   ## What changes are included in this PR?
   
   This PR makes the fully matched row-group proof conservative for nulls by 
adding `IS NULL` checks for nullable columns referenced by the predicate before 
evaluating the inverted pruning predicate.
   
   It also threads `with_missing_null_counts_as_zero` through 
`RowGroupPruningStatistics` so normal row-group pruning keeps the existing 
default behavior, while fully matched proofs treat missing null counts as 
unknown. This reuses the existing statistics conversion path instead of adding 
a separate null-count conversion pass.
   
   ## Are these changes tested?
   
   Added a regression test covering row groups with known nulls, known zero 
nulls, and missing null counts.
   
   Passed locally:
   
   - `cargo fmt --all`
   - `cargo fmt --all -- --check`
   - `git diff --check`
   - `cargo test -p datafusion-datasource-parquet 
row_group_fully_matched_requires_known_non_null_predicate_columns`
   - `cargo test -p datafusion-datasource-parquet row_group_pruning_predicate`
   - `cargo clippy -p datafusion-datasource-parquet --tests -- -D warnings`
   
   Also attempted locally:
   
   - `cargo test -p datafusion-datasource-parquet row_group_filter::tests` 
failed because `PARQUET_TEST_DATA` / `parquet-testing/data` is not available in 
this checkout.
   - `cargo clippy --all-targets --all-features -- -D warnings` failed when the 
local filesystem ran out of space writing `target/debug` metadata after 
checking `datafusion-datasource-parquet`.
   
   ## Are there any user-facing changes?
   
   No API changes. This only prevents false positives in the row-group fully 
matched optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to