adriangb opened a new pull request, #22237:
URL: https://github.com/apache/datafusion/pull/22237

   ## Which issue does this PR close?
   
   - Part of #22144 (Adaptive filter pushdown), split into a reviewable stack. 
This is **PR 4 of 4** — the integration.
   
   ## Rationale for this change
   
   With the wrapper type (#22234), per-conjunct pruning stats (#22235), and the 
cost model (#22236) in place, this PR wires them into the parquet scan so 
filter placement adapts to measured selectivity and throughput instead of a 
fixed pushdown decision.
   
   ## What changes are included in this PR?
   
   - `ParquetMorselizer` carries tagged predicate conjuncts and a shared 
`SelectivityTracker`; at file open the tracker partitions conjuncts into 
row-level vs post-scan buckets, seeded by the per-conjunct row-group / 
page-index pruning rates collected for free during pruning.
   - `AdaptiveParquetStream` drives the push decoder one row group at a time, 
re-partitioning at row-group boundaries and swapping the decoder strategy (row 
filter + projection mask) when placement changes.
   - Integrates with the fully-matched run splitting from #21637: fully-matched 
runs get a no-filter decoder; needs-filter runs get the adaptive setup.
   - `HashJoinExec` wraps its pushed-down dynamic filter in 
`OptionalFilterPhysicalExpr` so the tracker may drop it when it is not 
cost-effective; join correctness is unaffected.
   - Adds config knobs: `filter_pushdown_min_bytes_per_sec`, 
`filter_collecting_byte_ratio_threshold`, `filter_confidence_z`.
   
   ## Are these changes tested?
   
   Yes — parquet filter-pushdown integration tests, physical-optimizer 
filter-pushdown tests, proto round-trip, and sqllogictest coverage.
   
   ## Are there any user-facing changes?
   
   New parquet read config knobs (documented in `configs.md`). Behavior change 
to the parquet scan's filter placement. **Note:** this PR pins a custom 
`arrow-rs` branch for the push-decoder `StrategySwap` APIs; landing upstream 
requires those APIs in a released `arrow-rs` first.
   
   ---
   
   **Stacked PR — diff is cumulative against `main`.** Review the top commit 
*"feat: adaptive filter pushdown for the parquet scan"*; the commits below it 
are PRs #22234, #22235, #22236.
   
   Stack (review/merge in order):
   1. #22234 — OptionalFilterPhysicalExpr + proto
   2. #22235 — Per-conjunct pruning statistics
   3. #22236 — SelectivityTracker cost model
   4. **this PR** — Adaptive parquet scan integration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to