zhuqi-lucas opened a new issue, #23067:
URL: https://github.com/apache/datafusion/issues/23067

   ## Background
   
   Part of the [Sort Pushdown epic 
(#23036)](https://github.com/apache/datafusion/issues/23036) — specifically a 
follow-up to **[#22450](https://github.com/apache/datafusion/pull/22450)** 
(runtime RG-level early stop via TopK dynamic filter).
   
   #22450 closes the "dynamic mid-file" prune gap by dropping whole row groups 
whose stats prove they can't beat the current TopK threshold. There is one 
further, cleanly separable optimization on top of that mechanism, intentionally 
not in the merged PR because it depends on arrow-rs upstream:
   
   **For row groups that are `fully_matched`** (every row passes the user's 
`WHERE` predicate after static pruning — typical when the predicate is 
selective on metadata stats), the per-row `RowFilter` evaluation is pure 
overhead. We can suppress the `RowFilter` for those RGs and re-install it for 
partially-matched RGs at the next boundary.
   
   ## Why this is a follow-up, not part of #22450
   
   The toggle requires asking the decoder which row group it will emit next so 
the toggle targets the correct RG. arrow-rs's internal page-index pruning 
(`ColumnIndex` path in `try_build`) can silently skip RGs that DataFusion's 
`rg_plan` still contains, so `rg_plan.front()` drifts off-by-N from the 
decoder's actual frontier. Without a way to sync, the `fully_matched` toggle 
decision lands on the wrong RG.
   
   The clean fix is **`ParquetPushDecoder::peek_next_row_group()`** in arrow-rs 
([apache/arrow-rs#10158](https://github.com/apache/arrow-rs/pull/10158)), which 
exposes the decoder's actual next-RG frontier as a public API. Until that 
lands, this optimization can't be shipped.
   
   ## Design (already prototyped locally)
   
   Three coordinated pieces, all extensions of #22450's 
`PushDecoderStreamState`:
   
   1. **Track `fully_matched: Vec<bool>` per RG** in `PreparedAccessPlan`. 
Populated by the existing page-index pruning path (when every page of an RG 
passes the static predicate, mark it `fully_matched`).
   
   2. **Carry `fully_matched` through `rg_plan`** as `RgPlanEntry { rg_index, 
fully_matched }`. Pass through `reorder_by_statistics` and `reverse` so the 
flag stays aligned with the RG index after Inexact runtime reorder.
   
   3. **Toggle the `RowFilter` at each RG boundary** in 
`PushDecoderStreamState::transition`:
      - `peek_next_row_group()` → align `rg_plan.front()` with the decoder's 
actual next RG (handles arrow-rs's silent skips).
      - If `rg_plan.front().fully_matched`: rebuild decoder via 
`into_builder().with_row_filter(empty).build()` to suppress per-row eval.
      - Otherwise: rebuild with the saved `RowFilterContext` to restore the 
filter.
      - Track skips with a new metric `row_filter_skipped_fully_matched`.
   
   `into_builder` preserves buffered bytes across rebuilds, so toggling the 
filter is free of extra IO.
   
   ## Acceptance criteria
   
   - [ ] arrow-rs#10158 (`peek_next_row_group`) is merged and released.
   - [ ] Pin DataFusion's `parquet` dependency to the release that includes 
`peek_next_row_group`.
   - [ ] Restore the locally-prototyped toggle code into `push_decoder.rs` 
(`RowFilterContext`, `filter_installed`, the toggle block guarded by 
`peek_next_row_group()`).
   - [ ] Restore the `fully_matched` tracking in `ParquetAccessPlan` / 
`PreparedAccessPlan` (population, reorder, reverse, strip-empty).
   - [ ] Register and expose `row_filter_skipped_fully_matched: Count` on 
`ParquetFileMetrics`.
   - [ ] Integration test: `fully_matched_rgs_skip_row_filter` — 4 RGs with 
`WHERE v >= 3 ORDER BY v DESC LIMIT 3`, assert 
`row_filter_skipped_fully_matched >= 1`.
   - [ ] SLT coverage on `EXPLAIN ANALYZE` for the new metric.
   - [ ] Bench check: `sort_pushdown_inexact` narrow-projection queries (Q1 / 
Q2) — already at −46% / −60% with #22450 — confirm no regression and ideally a 
further trim from the saved row-filter CPU.
   
   ## Architectural note
   
   This optimization is a workaround for the lack of a `fully_matched` concept 
inside arrow-rs's own decode path. Long term, a cleaner design would move the 
optimization **into arrow-rs**: when the decoder builds the next RG's reader 
and finds it fully-matched via `ColumnIndex`, it could automatically skip the 
per-row `RowFilter` evaluation, and DataFusion would not need to mirror state 
at all. If that direction is taken upstream, the `fully_matched` field on 
`ParquetAccessPlan`, the toggle in `PushDecoderStreamState`, and the 
`peek_next_row_group` sync can all be deleted from DataFusion.
   
   For now, the DataFusion-side toggle is the pragmatic path.
   
   ## References
   
   - Parent epic: #23036
   - Merged PR: #22450
   - arrow-rs blocker: 
[apache/arrow-rs#10158](https://github.com/apache/arrow-rs/pull/10158)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to