Druva-D opened a new issue, #21795:
URL: https://github.com/apache/datafusion/issues/21795
### Is your feature request related to a problem or challenge?
Filters like `WHERE struct_col IS NOT NULL` on struct columns cannot be
pushed down into the Parquet scan. The `PushdownChecker` in `row_filter.rs`
blanket-rejects any filter that references a whole struct column, even though
`IS NOT NULL` only needs the struct's null bitmap — not any leaf data.
This forces a `FilterExec` above the scan that materializes ALL leaf columns
of the struct just to check nullability:
```
FilterExec: price@4 IS NOT NULL
DataSourceExec: projection=[..., price, ...] -- reads all price.* leaf
columns
```
For a struct with many fields, this means reading and decoding all leaf
columns when semantically zero data is needed — only the struct's null bitmap.
PRs #20822 and #20854 added pushdown support for `get_field` expressions on
struct fields (e.g., `s['value'] > 10`), but `IS NULL` / `IS NOT NULL` on the
whole struct was explicitly left unsupported. The test
`struct_data_structures_prevent_pushdown` asserts this rejection.
### Describe the solution you'd like
Allow `IS NULL` and `IS NOT NULL` on struct columns to be pushed down as
row-level filters, reading only a single leaf column instead of all leaves.
In Parquet, definition levels encode nullability at every nesting level
independently. When arrow-rs reads even one leaf column, it reconstructs the
struct's null bitmap from definition levels. `is_not_null()` then checks this
bitmap, not the leaf's data.
This means we can:
1. Detect `IS NULL(Column(struct))` / `IS NOT NULL(Column(struct))` in
`PushdownChecker::f_down` before the `Column` node is visited
2. Project only the first leaf column of the struct in the `ProjectionMask`
3. Let arrow-rs reconstruct the struct null bitmap from that single leaf's
definition levels
### Describe alternatives you've considered
1. **Definition-level-only reads**: Read only the definition levels of one
leaf column without decoding the data values. This would be optimal (~1 bit per
row vs 4+ bytes) but requires arrow-rs API changes
(`ProjectionMask::definition_levels_only()` or similar) that don't exist today.
2. **Statistics-based pruning only**: Use Parquet row-group null_count
statistics to prune entire row groups. However, Parquet statistics are stored
only for leaf columns, not struct columns, so struct-level null_count isn't
directly available from metadata.
### Additional context
Related to the filter pushdown EPIC: #20324
Builds on: #20822 (struct field pushdown) and #20854 (leaf projection
refinement)
PR generated with [Claude Code](https://claude.com/product/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]