zhuqi-lucas opened a new issue, #23067: URL: https://github.com/apache/datafusion/issues/23067
## Background Part of the [Sort Pushdown epic (#23036)](https://github.com/apache/datafusion/issues/23036) — specifically a follow-up to **[#22450](https://github.com/apache/datafusion/pull/22450)** (runtime RG-level early stop via TopK dynamic filter). #22450 closes the "dynamic mid-file" prune gap by dropping whole row groups whose stats prove they can't beat the current TopK threshold. There is one further, cleanly separable optimization on top of that mechanism, intentionally not in the merged PR because it depends on arrow-rs upstream: **For row groups that are `fully_matched`** (every row passes the user's `WHERE` predicate after static pruning — typical when the predicate is selective on metadata stats), the per-row `RowFilter` evaluation is pure overhead. We can suppress the `RowFilter` for those RGs and re-install it for partially-matched RGs at the next boundary. ## Why this is a follow-up, not part of #22450 The toggle requires asking the decoder which row group it will emit next so the toggle targets the correct RG. arrow-rs's internal page-index pruning (`ColumnIndex` path in `try_build`) can silently skip RGs that DataFusion's `rg_plan` still contains, so `rg_plan.front()` drifts off-by-N from the decoder's actual frontier. Without a way to sync, the `fully_matched` toggle decision lands on the wrong RG. The clean fix is **`ParquetPushDecoder::peek_next_row_group()`** in arrow-rs ([apache/arrow-rs#10158](https://github.com/apache/arrow-rs/pull/10158)), which exposes the decoder's actual next-RG frontier as a public API. Until that lands, this optimization can't be shipped. ## Design (already prototyped locally) Three coordinated pieces, all extensions of #22450's `PushDecoderStreamState`: 1. **Track `fully_matched: Vec<bool>` per RG** in `PreparedAccessPlan`. Populated by the existing page-index pruning path (when every page of an RG passes the static predicate, mark it `fully_matched`). 2. **Carry `fully_matched` through `rg_plan`** as `RgPlanEntry { rg_index, fully_matched }`. Pass through `reorder_by_statistics` and `reverse` so the flag stays aligned with the RG index after Inexact runtime reorder. 3. **Toggle the `RowFilter` at each RG boundary** in `PushDecoderStreamState::transition`: - `peek_next_row_group()` → align `rg_plan.front()` with the decoder's actual next RG (handles arrow-rs's silent skips). - If `rg_plan.front().fully_matched`: rebuild decoder via `into_builder().with_row_filter(empty).build()` to suppress per-row eval. - Otherwise: rebuild with the saved `RowFilterContext` to restore the filter. - Track skips with a new metric `row_filter_skipped_fully_matched`. `into_builder` preserves buffered bytes across rebuilds, so toggling the filter is free of extra IO. ## Acceptance criteria - [ ] arrow-rs#10158 (`peek_next_row_group`) is merged and released. - [ ] Pin DataFusion's `parquet` dependency to the release that includes `peek_next_row_group`. - [ ] Restore the locally-prototyped toggle code into `push_decoder.rs` (`RowFilterContext`, `filter_installed`, the toggle block guarded by `peek_next_row_group()`). - [ ] Restore the `fully_matched` tracking in `ParquetAccessPlan` / `PreparedAccessPlan` (population, reorder, reverse, strip-empty). - [ ] Register and expose `row_filter_skipped_fully_matched: Count` on `ParquetFileMetrics`. - [ ] Integration test: `fully_matched_rgs_skip_row_filter` — 4 RGs with `WHERE v >= 3 ORDER BY v DESC LIMIT 3`, assert `row_filter_skipped_fully_matched >= 1`. - [ ] SLT coverage on `EXPLAIN ANALYZE` for the new metric. - [ ] Bench check: `sort_pushdown_inexact` narrow-projection queries (Q1 / Q2) — already at −46% / −60% with #22450 — confirm no regression and ideally a further trim from the saved row-filter CPU. ## Architectural note This optimization is a workaround for the lack of a `fully_matched` concept inside arrow-rs's own decode path. Long term, a cleaner design would move the optimization **into arrow-rs**: when the decoder builds the next RG's reader and finds it fully-matched via `ColumnIndex`, it could automatically skip the per-row `RowFilter` evaluation, and DataFusion would not need to mirror state at all. If that direction is taken upstream, the `fully_matched` field on `ParquetAccessPlan`, the toggle in `PushDecoderStreamState`, and the `peek_next_row_group` sync can all be deleted from DataFusion. For now, the DataFusion-side toggle is the pragmatic path. ## References - Parent epic: #23036 - Merged PR: #22450 - arrow-rs blocker: [apache/arrow-rs#10158](https://github.com/apache/arrow-rs/pull/10158) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
