zhuqi-lucas opened a new issue, #23216: URL: https://github.com/apache/datafusion/issues/23216
## Summary Today page-level pruning in Parquet (`opener/mod.rs:1314` → `PagePruningPredicate::prune_plan_with_page_index_and_metrics`) runs **once at file open** with the static query predicate. #22450 added dynamic RG-level pruning at every RG boundary (`should_prune` in `push_decoder.rs:183`), but its rebuild path never re-evaluates the page-level predicate. This issue extends #22450's "refresh at RG boundary" pattern to **also refresh the `PagePruningPredicate`**, so the page-level `RowSelection` of upcoming RGs is tightened by the latest TopK threshold. ## Current state (source-confirmed) | Prune type | Where | Data | Dynamic? | |---|---|---|---| | RG-level (#22450) | `push_decoder.rs:183 should_prune` (RG boundary) | RG metadata min/max | ✅ rebuilt every RG boundary | | **Page-level** | `opener/mod.rs:1314` (**file open only**) | page index | ❌ snapshot at file open | | Row-level (RowFilter) | per batch | filter column values | ✅ reads latest threshold | Gap: after #22450, RG-level is dynamic but page-level is still static. If TopK heap tightens after file open, surviving RGs still have their initial (loose) page-level `RowSelection` — pages whose min/max no longer survive the new threshold are still fetched + decompressed + decoded for filter-col evaluation. ## Proposal At every RG boundary (`PushDecoderStreamState::transition`): 1. `tracker.changed()` — same single atomic load #22450 uses 2. If changed: rebuild a fresh `PagePruningPredicate` from latest filter 3. Walk remaining RGs in access plan; refine each `RowSelection` via `prune_plan_with_page_index_and_metrics` 4. Apply via existing `into_builder() → with_row_groups(...) → build()` Errors fall back to "keep current selection" (mirrors `should_prune`). ## Expected wins Saves filter-column **IO + decompress + decode** for individual dead pages — extends #22450's "chip away Layer B residue" philosophy from RG to page granularity. Most useful when: - RGs are large (many pages each) - Threshold tightens significantly mid-scan (e.g. after first few RGs fill the heap) - Page index is enabled (prerequisite — without it, no-op) ## Prerequisites - `datafusion.execution.parquet.enable_page_index = true` - Filter column present in file schema - Predicate chain contains a `DynamicFilter` (TopK source) ## Open design questions 1. **Refresh frequency**: every RG boundary, or only when `tracker.changed()` returns true? 2. **Granularity**: refresh access plan for *all* surviving RGs, or only the next one to be touched? 3. **arrow-rs API gap**: does the existing `with_row_groups(...)` path accept an updated per-RG `RowSelection`, or do we need a new arrow-rs API hook? (May overlap with arrow-rs#10158 territory.) 4. **Stretch goal · mid-RG refresh**: refresh *between* pages of the same RG, not just at RG boundary. Needs a brand-new arrow-rs "mid-RG predicate adapt" callback hook. ## Related - #22450 — RG-level dynamic prune (the foundation this extends) - #23067 — Per-RG \`fully_matched\` RowFilter skip - arrow-rs#10158 — \`peek_next_row_group\` (related rebuild surface) - arrow-rs#9937 — Page-level reverse iteration (independent but adjacent) Part of the Sort Pushdown EPIC #23036, future direction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
