Shubham Roy created HBASE-29974:
-----------------------------------

             Summary: Filter seek hints underutilized due to early circuit 
breaks in scan pipeline, causing unnecessary cell-level iteration
                 Key: HBASE-29974
                 URL: https://issues.apache.org/jira/browse/HBASE-29974
             Project: HBase
          Issue Type: Improvement
          Components: Filters, Scanners
    Affects Versions: 2.5.13, 2.6.4
            Reporter: Shubham Roy
            Assignee: Shubham Roy


h1. Summary

The filter seek-hint infrastructure (SEEK_NEXT_USING_HINT / getNextCellHint) is 
only reachable through one narrow path in the scan pipeline. Multiple earlier 
circuit breaks — time range mismatch, column mismatch, version exhaustion, and 
filterRowKey rejection — all short-circuit before the filter is consulted, 
forcing the scanner to advance one cell at a time even when the filter could 
provide a large forward jump.

h1. Background

HBase's filter API supports SEEK_NEXT_USING_HINT + getNextCellHint() to allow a 
filter to tell the scanner "jump directly to this cell, skipping everything in 
between." This is the most powerful skip primitive available. However, it is 
only reachable via one path in matchColumn:

{code:java}
// All three must pass for filterCell to be reached:
tr.compare(timestamp) == 0       // time range gate
columns.checkColumn() == INCLUDE  // column gate
columns.checkVersions() == INCLUDE* // version gate
→ filter.filterCell(cell)         // only here can SEEK_NEXT_USING_HINT be 
returned
{code}

Every other code path bypasses filterCell entirely.

h1. Problem

h2. Problem 1 — Uninteresting rows (filterRowKey=true)

When filterRowKey() returns true, the scanner calls nextRow(), which scans 
forward one cell at a time via storeHeap.next(MOCKED_LIST). Inside this path, 
matcher.match() is called per cell, but filterCell is only reached if a cell 
passes the time range check. For rows with no cells in the scan's time range, 
the time range gate fires for every cell, filterCell is never called, and the 
filter's hint is unreachable. The scanner pays O(cells-in-row) cost per 
rejected row rather than seeking directly to the next location.

h2. Problem 2 — Rows with cells outside the time range (filterRowKey=false)

Even when a row is not rejected at the row key level, cells outside the time 
range hit:

{code:java}
if (tsCmp > 0) { return MatchCode.SKIP; }               // filter bypassed
if (tsCmp < 0) { return columns.getNextRowOrNextColumn; } // filter bypassed
{code}

The filter is never consulted. If the filter could determine a better skip 
target for these cells, that capability is wasted.

h2. Problem 3 — Cells failing column or version gates (filterRowKey=false, cell 
in time range)

Even for cells within the time range, two further gates can short-circuit 
before filterCell:

# checkColumn() ≠ INCLUDE → returns column-tracker hint (SEEK_NEXT_COL) without 
consulting filter
# checkVersions() = SKIP or SEEK_NEXT_COL → returns without consulting filter

The column tracker can only suggest the next column or row. The filter may know 
a much better target (e.g., skip several columns, or skip to a completely 
different row), but is never asked.

h1. Impact

In all three cases, the scanner is forced into a cell-by-cell or row-by-row 
iteration that it could avoid if the filter's hint were consulted. Filters with 
efficient seeking logic (e.g., FuzzyRowFilter, ColumnRangeFilter, custom range 
filters) incur unnecessary I/O proportional to the number of skipped cells/rows.

h1. Root Cause

The filter hint mechanism and the scan pipeline's short-circuit mechanism are 
disconnected. Short-circuits exist for correctness and efficiency reasons (time 
range, column set, version limits), but they each bypass the filter as a side 
effect. The filter has no opportunity to provide a hint unless a cell passes 
every prior gate.


h1. Solution

Two new purpose-built API methods are introduced on Filter (with concrete 
default implementations returning null for full backward compatibility):

Filter.getHintForRejectedRow(Cell firstRowCell)
Addresses Path 1. Called in RegionScannerImpl immediately after filterRowKey() 
returns true, instead of calling filterCell(). Gives the filter an opportunity 
to provide a seek target to bypass row-by-row scanning.

Contract:

* Only called after filterRowKey returns true for the same cell
* May use state derived from filterRowKey (e.g., current range pointer in 
MultiRowRangeFilter)
* Must not invoke filterCell logic — callers guarantee filterCell has not been 
called for this row
* Default returns null (falls through to existing nextRow() behavior)

Filter.getSkipHint(Cell skippedCell)
Addresses Path 2. Called at every structural short-circuit in matchColumn 
before filterCell is reached. Gives the filter an opportunity to provide a seek 
target for cells skipped by the time range, column, or version gate.

Contract:

* May be called for cells that have not been passed through filterCell
* Must not modify filter state (completely stateless)
* Only filters with immutable, configuration-based hint computation should 
override this
* Default returns null (falls through to existing skip/seek behavior)





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to