Druva-D opened a new pull request, #21796: URL: https://github.com/apache/datafusion/pull/21796
## Which issue does this PR close? - Closes #21795 ## Rationale for this change `IS NULL` / `IS NOT NULL` on struct columns is blanket-rejected from Parquet row filter pushdown, forcing all leaf columns to be materialized post-scan just to check nullability. In Parquet, definition levels encode struct nullability independently — arrow-rs reconstructs the struct's null bitmap from any single leaf. This PR exploits that to push down struct null checks while reading only one leaf column. | Scenario | No Pushdown | With Pushdown | Speedup | |---|---|---|---| | `SELECT id WHERE s IS NOT NULL` | 749ms | 63ms | **11.9x** | | `SELECT * WHERE s IS NOT NULL` | 733ms | 1040ms | 0.7x | The speedup applies when the struct is filtered but not projected. `SELECT *` shows no benefit since all leaves are read for the output anyway. ## What changes are included in this PR? - Intercept `IS NULL(Column(struct))` / `IS NOT NULL(Column(struct))` in `PushdownChecker::f_down` before the Column node triggers the blanket struct rejection - Resolve null checks to only the first Parquet leaf via `resolve_struct_null_check_leaves()`. Merge the first leaf path into field access paths in `build_filter_schema` to avoid schema/mask mismatch when combined with `get_field` in OR expressions - Renamed `struct_data_structures_prevent_pushdown` → `struct_is_not_null_allows_pushdown` (assertion flipped — struct null checks are now supported) ## Are these changes tested? Yes — 10 unit tests (pushdown acceptance, correctness with all-null leaves, nested structs, OR expressions with combined null check + field access, NOT wrapping), plus an integration test verifying `pushdown_rows_pruned`/`pushdown_rows_matched` metrics through the full `SessionContext` pipeline. Tests generated with the help of [Claude Code](https://claude.com/product/claude-code) ## Are there any user-facing changes? No. Struct null check pushdown activates automatically when `pushdown_filters` is enabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
