zhuqi-lucas opened a new issue, #21916:
URL: https://github.com/apache/datafusion/issues/21916

   ## Background
   
   #21828 implements OFFSET pushdown for parquet queries without filters. 
Queries with WHERE clauses still use `GlobalLimitExec` for offset handling 
because row counts may be inaccurate after filtering.
   
   ## Problem
   
   For queries like `SELECT * FROM table WHERE date >= '2020-01-01' LIMIT 5 
OFFSET 1000000`, the offset is handled by `GlobalLimitExec` even when 
statistics prove all rows in some RGs satisfy the filter.
   
   ## Opportunity
   
   `prune_by_statistics` already marks RGs as `is_fully_matched` when column 
statistics prove ALL rows satisfy the predicate (e.g., `min(date) >= 
'2020-01-01'`). For these RGs, `num_rows` is the exact qualifying row count — 
safe to use for offset calculation.
   
   ## Proposed approach
   
   1. During `prune_by_offset`, skip leading **fully-matched** RGs whose 
cumulative rows fall within offset (already implemented in #21828's 
`prune_by_offset` with `has_predicate` flag)
   2. Stop at the first non-fully-matched RG (qualifying row count unknown)
   3. `GlobalLimitExec` handles the remaining offset (reduced by skipped rows)
   4. Need mechanism to communicate skipped row count from parquet opener back 
to `GlobalLimitExec` (reduce its skip)
   
   ## Challenge
   
   The key difficulty is coordinating between parquet-level RG skipping and 
`GlobalLimitExec`'s skip counter. The optimizer sets `GlobalLimitExec(skip=N)` 
at plan time, but the actual RG-level skipping happens at runtime. Options:
   - Shared counter between opener and GlobalLimitExec
   - Dynamic adjustment of GlobalLimitExec skip based on DataSourceExec's output
   
   ## Related
   
   - #21828 — Single-file no-filter OFFSET pushdown (parent PR)
   - #21915 — Multi-file OFFSET pushdown
   - #19654 — Original issue for OFFSET performance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to