Re: index prefetching

Andres Freund Sat, 21 Mar 2026 12:55:04 -0700

Hi,

On 2026-03-20 20:33:02 -0400, Andres Freund wrote:
> I'll try to read through the rest soon. But now: Food.


I'm still wondering about whether 0003 can be split apart - it's just huge and
does a lot of things at once. I find it hard to get patches of that size into
a mergeable state at once.

What about splitting out the changes to the indexscan API, i.e. moving IOS
handling to indexam.c / heapam_handler.c into its own commit?  I feel like
that's a sizeable chunk that needs to touch a lot of places that don't need to
be touched in the commit adding batching. And it's such an architectural
improvement that it's a worthwhile thing to do on its own.

One advantage of that would be that we could merge the VISITED_PAGES_LIMIT
before the introduction of batches.


If the the amgetbatch introduction didn't also replace ammarkpos/amrestrpos
with amposreset (see also question in previous email), I'd say that I'd split
off the introduction of the batch interface from the conversion of btree to
it. But it looks like it'd be not entirely trivial to keep both interfaces
around, so maybe that's not worth it.



> From c1d6a412a943189372cc733b1041bff96e8c8e76 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Thu, 19 Mar 2026 12:32:34 -0400
> Subject: [PATCH v16 04/17] Add VISITED_PAGES_LIMIT to heapam
>
> ---
>  src/include/access/relscan.h             |  7 +++++++
>  src/backend/access/heap/heapam_handler.c | 22 ++++++++++++++++++++++
>  src/backend/access/index/genam.c         |  1 +
>  src/backend/utils/adt/selfuncs.c         |  9 +++++----
>  4 files changed, 35 insertions(+), 4 deletions(-)
>
> diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
> index 9c1563c88..13370b1c9 100644
> --- a/src/include/access/relscan.h
> +++ b/src/include/access/relscan.h
> @@ -372,6 +372,13 @@ typedef struct IndexScanDescData
>
>       /* parallel index scan information, in shared memory */
>       struct ParallelIndexScanDescData *parallel_scan;
> +
> +     /*
> +      * Limit on distinct heap pages visited before giving up (0 = no limit).
> +      * Used by selfuncs.c to bound the cost of 
> get_actual_variable_endpoint().
> +      */
> +     uint8           xs_visited_pages_limit;

I'd maybe phrase it a bit more vaguely, e.g

    An approximate limit on the amount of work, measured in pages touched,
    imposed on the index scan. The default, 0, means no limit. Used by
    selfuncs.c to bound the cost of get_actual_variable_endpoint().

so that, in the near future (presumably 20), we can just start to also take
the number of index pages visited into account. Because I do think your
concern that leaving out the number of index pages from the limit is
correct. The easiest way there seems to be to just have one shared limit.


> diff --git a/src/backend/access/heap/heapam_handler.c 
> b/src/backend/access/heap/heapam_handler.c
> index b8cc3d39c..1b58db199 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -824,6 +824,7 @@ heapam_index_only_getnext_slot(IndexScanDesc scan, 
> ScanDirection direction,
>                               heap_continue = false;
>
>       Assert(scan->xs_want_itup);
> +     Assert(scan->xs_visited_pages_limit == 0);
>
>       for (;;)
>       {

Why do we not need to support this for amgettuple indexes?


> @@ -906,6 +907,9 @@ heapam_index_only_batch_getnext_slot(IndexScanDesc scan, 
> ScanDirection direction
>       ItemPointer tid;
>       bool            all_visible,
>                               heap_continue = false;
> +     BlockNumber last_visited_block = InvalidBlockNumber;
> +     uint8           n_visited_pages = 0,
> +                             xs_visited_pages_limit = 
> scan->xs_visited_pages_limit;
>
>       Assert(scan->xs_want_itup);
>
> @@ -936,7 +940,25 @@ heapam_index_only_batch_getnext_slot(IndexScanDesc scan, 
> ScanDirection direction
>                               scan->instrument->nheapfetches++;
>
>                       if (!heapam_index_fetch_heap(scan, slot, 
> &heap_continue, true))
> +                     {
> +                             /*
> +                              * No visible tuple.  If caller set a 
> visited-pages limit
> +                              * (only selfuncs.c does this), count distinct 
> heap pages and
> +                              * give up once we've visited too many.
> +                              */
> +                             if (unlikely(xs_visited_pages_limit > 0))
> +                             {
> +                                     Assert(hscan->xs_blk == 
> ItemPointerGetBlockNumber(tid));
> +
> +                                     if (hscan->xs_blk != last_visited_block)
> +                                     {
> +                                             last_visited_block = 
> hscan->xs_blk;
> +                                             if (++n_visited_pages > 
> xs_visited_pages_limit)
> +                                                     return false;   /* give 
> up */
> +                                     }
> +                             }
>                               continue;               /* no visible tuple, 
> try next index entry */
> +                     }
>
>                       ExecClearTuple(slot);

I wonder if the accounting for the number of visited pages wouldn't better be
done in heapam_index_fetch_tuple(), as that already has a dedicated branch for
the cross-heap-page-boundary case.  I guess the current location was
convenient because it provides for an easy control flow for ending the scan?


> diff --git a/src/backend/access/index/genam.c 
> b/src/backend/access/index/genam.c
> index 6e87169c2..d0b68f38e 100644
> --- a/src/backend/access/index/genam.c
> +++ b/src/backend/access/index/genam.c
> @@ -126,6 +126,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, 
> int norderbys)
>       scan->xs_itupdesc = NULL;
>       scan->xs_hitup = NULL;
>       scan->xs_hitupdesc = NULL;
> +     scan->xs_visited_pages_limit = 0;
>
>       scan->batch_index_opaque_size = 0;
>       scan->batch_tuples_workspace = 0;
> diff --git a/src/backend/utils/adt/selfuncs.c 
> b/src/backend/utils/adt/selfuncs.c
> index 1f8b0e684..768581cf7 100644
> --- a/src/backend/utils/adt/selfuncs.c
> +++ b/src/backend/utils/adt/selfuncs.c
> @@ -7168,11 +7168,11 @@ get_actual_variable_endpoint(Relation heapRel,
>        * pages.  When we fail for that reason, the caller will end up using
>        * whatever extremal value is recorded in pg_statistic.
>        *
> -      * XXX This can't work with the new table_index_getnext_slot interface,
> -      * which simply won't return a tuple that isn't visible to our snapshot.
> -      * table_index_getnext_slot will need some kind of callback that 
> provides
> -      * a way for the scan to give up when the costs start to get out of 
> hand.
> +      * We set xs_visited_pages_limit to tell the table AM to count distinct
> +      * heap pages visited for non-visible tuples and give up after the limit
> +      * is exceeded.
>        */
> +#define VISITED_PAGES_LIMIT 100
>       InitNonVacuumableSnapshot(SnapshotNonVacuumable,
>                                                         
> GlobalVisTestFor(heapRel));
>
> @@ -7180,6 +7180,7 @@ get_actual_variable_endpoint(Relation heapRel,
>                                                                
> &SnapshotNonVacuumable, NULL,
>                                                                1, 0);
>       Assert(index_scan->xs_want_itup);
> +     index_scan->xs_visited_pages_limit = VISITED_PAGES_LIMIT;
>       index_rescan(index_scan, scankeys, 1, NULL, 0);

Wonder if it'd be worth doing this via a wrapper like index_limit_effort()
that could check that the scan is an IOS, uses batch mode etc.


> From c1602d1dc3c4e2dcd4afad12400ed3c28db71396 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Sat, 15 Nov 2025 14:03:58 -0500
> Subject: [PATCH v16 05/17] Add heapam index scan I/O prefetching.
>

> Testing has shown that index prefetching can make index scans much
> faster.  Large range scans that return many tuples can be as much as 35x
> faster.

I think on networked storage (unfortunately what most postgres instances run
on these days), there's easily much bigger wins :)

Perhaps worth mentioning that the wins are bigger with higher latency storage?


> In rare cases where the scan exceeds this quasi-arbitrary limit of 64,
> the read stream is temporarily paused using the read stream pausing
> mechanism added by commit 38229cb9.  Prefetching (via the read stream)
> is resumed only after the scan position advances beyond its current open
> batch and then frees and removes the batch from the scan's batch ring
> buffer.  Testing has shown that it isn't very common for scans to hold
> open more than about 10 batches to get the desired I/O prefetch
> distance.
>
> The heuristic used to decide when to begin prefetching -- delaying until
> the scan's second batch -- is imperfect.

Perhaps worth introducing the heuristic, and the need for it, in a bit more
detail?


> Selective index scans that
> access randomly-ordered heap pages will benefit from prefetching, but

s/will/would/?



> diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
> index 13370b1c9..3eb41e707 100644
> --- a/src/include/access/relscan.h
> +++ b/src/include/access/relscan.h

> +/*
> + * Compare two batch ring positions in the given scan direction.
> + *
> + * Returns negative if pos1 is behind pos2, 0 if equal, positive if pos1 is
> + * ahead of pos2.
> + */
> +static inline int
> +index_scan_pos_cmp(BatchRingItemPos *pos1, BatchRingItemPos *pos2,
> +                                ScanDirection direction)
> +{
> +     int8            batchdiff = (int8) (pos1->batch - pos2->batch);
> +
> +     if (batchdiff != 0)
> +             return batchdiff;
> +
> +     /* Same batch, compare items */
> +     if (ScanDirectionIsForward(direction))
> +             return pos1->item - pos2->item;
> +     else
> +             return pos2->item - pos1->item;
> +}

Continuing to think that a bunch of this really doesn't belong into relscan.h
but should be in a dedicated indexbatch.h or such.



>  /*
>   * Advance position to its next item in the batch.
>   *
> diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
> index f2fd5d315..419300a6b 100644
> --- a/src/include/optimizer/cost.h
> +++ b/src/include/optimizer/cost.h
> @@ -52,6 +52,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
>  extern PGDLLIMPORT bool enable_seqscan;
>  extern PGDLLIMPORT bool enable_indexscan;
>  extern PGDLLIMPORT bool enable_indexonlyscan;
> +extern PGDLLIMPORT bool enable_indexscan_prefetch;
>  extern PGDLLIMPORT bool enable_bitmapscan;
>  extern PGDLLIMPORT bool enable_tidscan;
>  extern PGDLLIMPORT bool enable_sort;

Just reminded me: At some point soon we're going to have to adjust the costing
model to account for prefetching. I think there will be a fewer cases where
bitmap scans are the correct choice than there are today...



> diff --git a/src/backend/access/heap/heapam_handler.c 
> b/src/backend/access/heap/heapam_handler.c
> index 1b58db199..28a144d1c 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -37,6 +37,7 @@
>  #include "commands/progress.h"
>  #include "executor/executor.h"
>  #include "miscadmin.h"
> +#include "optimizer/cost.h"
>  #include "pgstat.h"
>  #include "storage/bufmgr.h"
>  #include "storage/bufpage.h"

I suspect there's a good argument for moving the decision about whether to
enable prefetching to the planner instead of doing it here. But doing it here
doesn't have a lot of architectural impact for now, so we don't necessarily
have to move the point of decision making now.


> @@ -61,6 +62,9 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc 
> hscan);
>  static bool BitmapHeapScanNextBlock(TableScanDesc scan,
>                                                                       bool 
> *recheck,
>                                                                       uint64 
> *lossy_pages, uint64 *exact_pages);
> +static BlockNumber heapam_getnext_stream(ReadStream *stream,
> +                                                                             
>  void *callback_private_data,
> +                                                                             
>  void *per_buffer_data);
>

I think heapam_getnext_stream() is too generally named. For one it could just
as well be for a sequential scan, for another it sounds like it's a function
like heap_getnext() or such.  Maybe heapam_index_scan_next_block() or such?



>  /* ------------------------------------------------------------------------
> @@ -102,6 +106,17 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
>       /* Rescans should avoid an excessive number of VM lookups */
>       hscan->xs_vm_items = 1;
>
> +     /* Reset read stream direction unconditionally */
> +     hscan->xs_read_stream_dir = NoMovementScanDirection;

That's obviously the case given the assignment, but it doesn't even hint at
why?


> @@ -192,7 +210,14 @@ heapam_index_fetch_tuple(struct IndexFetchTableData 
> *scan,
>               if (BufferIsValid(hscan->xs_cbuf))
>                       ReleaseBuffer(hscan->xs_cbuf);
>
> -             hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
> +             /*
> +              * When using a read stream, the stream will already know which 
> block
> +              * number comes next (though an assertion will verify a match 
> below)
> +              */
> +             if (hscan->xs_read_stream)
> +                     hscan->xs_cbuf = 
> read_stream_next_buffer(hscan->xs_read_stream, NULL);
> +             else
> +                     hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, 
> hscan->xs_blk);
>
>               /*
>                * Prune page when it is pinned for the first time

The assertion mentioned is only after the call to heap_pageprune_opt(). Which
is probably fine, but perhaps it's worth just doing it immediately after the
above if block, so if the assertion is triggered one can figure out quickly
it's the fault of the read stream getting confused.



> @@ -277,6 +302,30 @@ heapam_index_fetch_tuple(struct IndexFetchTableData 
> *scan,
>   * (important for inner index scans of anti-joins and semi-joins), and the
>   * need to not hold onto index leaf pages for too long.
>   *
> + * Dropping leaf page pins early
> + * -----------------------------
> + *
> + * In no event will the scan be allowed to hold onto more than one batch's
> + * leaf page pin at a time.  The primary reason for this restriction is to
> + * avoid unintended interactions with the read stream, which has its own
> + * strategy for keeping the number of pins held by the backend under control.
> + *
> + * Once we've resolved visibility for all items in a batch, we can safely 
> drop
> + * its leaf page pin.  This is safe with respect to concurrent VACUUM because
> + * index vacuuming will block on acquiring a conflicting cleanup lock on the
> + * batch's index page due to our holding a pin on that same page.  Copying 
> the
> + * relevant visibility map data into our local cache suffices to prevent 
> unsafe
> + * concurrent TID recycling: if any of these TIDs point to dead heap tuples,
> + * VACUUM cannot possibly return from ambulkdelete and mark the pointed-to
> + * heap pages as all-visible.  VACUUM _can_ do so once we release the batch's
> + * pin, but that's okay; we'll be working off of cached visibility info that
> + * indicates that the dead TIDs are NOT all-visible.

I think it'd be worth also explaining why it's safe to use the cached VM bits
in the case where the tuples on an all-vsibile pge were only deleted after
after we fetched the VM. I.e. that it's ok for this scan to think the page is
all visible, because for the purpose of the tids it already had fetched from
the index, it was all visible, i.e. all those tids will also be visible to the
current mvcc snapshot.



> @@ -558,6 +609,35 @@ heapam_batch_getnext(IndexScanDesc scan, ScanDirection 
> direction,

I'm also worried a bit about the namespacing of these functions (were
introduced in an earlier commit, but ...). Not at all clear that a function of
that name couldn't be part of a seqscan or such.

Maybe we should prefix all of these in something like heapam_iscan_ or such?

>
>               /* Append batch to the end of ring buffer/write it to buffer 
> index */
>               index_scan_batch_append(scan, batch);
> +
> +             /*
> +              * Delay initializing stream until reading from scan's second 
> batch.
> +              * This heuristic avoids wasting cycles on starting a read 
> stream for
> +              * very selective index scans.  We can likely improve upon 
> this, but
> +              * it works well enough for now.
> +              *
> +              * Also avoid prefetching during scans where we're unable to 
> drop each
> +              * batch's buffer pin right away (non-MVCC snapshot scans).  We 
> are
> +              * not prepared to sensibly limit the total number of buffer 
> pins held
> +              * (read stream handles all pin resource management for us, and 
> knows
> +              * nothing about pins held on index pages/within batches).
> +              *
> +              * Also delay creating a read stream during index-only scans 
> that
> +              * haven't done any heap fetches yet.  We don't want to waste 
> any
> +              * cycles on allocating a read stream until we have a 
> demonstrated
> +              * need for perform heap fetches.
> +              */
> +             if (!hscan->xs_read_stream && priorBatch && scan->MVCCScan &&
> +                     hscan->xs_blk != InvalidBlockNumber &&  /* for 
> index-only scans */
> +                     enable_indexscan_prefetch)

I wonder if it'd be worth moving this into its own helper function
(e.g. heapam_index_consider_prefetching()), just to make it a bit more obvious
that that's an important decision point.  I'd also expect the logic to become
a bit more complicated in the future.


> +             {
> +                     Assert(!batchringbuf->prefetchPos.valid);
> +
> +                     hscan->xs_read_stream =
> +                             read_stream_begin_relation(READ_STREAM_DEFAULT, 
> NULL,
> +                                                                             
>    scan->heapRelation, MAIN_FORKNUM,
> +                                                                             
>    heapam_getnext_stream, scan, 0);

It's too bad efficiency wise that we can't safely use
READ_STREAM_USE_BATCHING.  For some workloads it looks like it's a decent
improvement.  We could probably do the work to make it safe (basically exiting
batchmode in heapam_getnext_stream() when we have to do index IO and then
re-enabling it afterwards), but that's clearly for later.



> +/*
> + * Handle a change in index scan direction (at the tuple granularity).
> + *
> + * Resets the read stream, since we can't rely on scanPos continuing to agree
> + * with the blocks that read stream already consumed using prefetchPos.
> + *
> + * Note: iff the scan _continues_ in this new direction, and actually steps
> + * off scanBatch to an earlier index page, heapam_batch_getnext will deal 
> with
> + * it.  But that might never happen; the scan might yet change direction 
> again
> + * (or just end before returning more items).
> + */
> +static pg_noinline void
> +heapam_dirchange_readstream_reset(IndexFetchHeapData *hscan,
> +                                                               ScanDirection 
> direction,
> +                                                               
> BatchRingBuffer *batchringbuf)
> +{
> +     /* Reset read stream state */
> +     batchringbuf->prefetchPos.valid = false;
> +     hscan->xs_paused = false;
> +     hscan->xs_read_stream_dir = direction;
> +
> +     /* Reset read stream itself */
> +     if (hscan->xs_read_stream)
> +             read_stream_reset(hscan->xs_read_stream);
> +}

Comparing this to heapam_index_fetch_reset():
Should this also invalidate hscan->xs_prefetch_block?

And the other way round, should heapam_index_fetch_reset() not reset
batchringbuf->prefetchPos.valid?



> @@ -654,21 +766,275 @@ heapam_batch_getnext_tid(IndexScanDesc scan, 
> IndexFetchHeapData *hscan,
>       {
>               IndexScanBatch headBatch = index_scan_batch(scan,
>                                                                               
>                         batchringbuf->headBatch);
> +             BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos;
>
>               /* free obsolescent head batch (unless it is scan's markBatch) 
> */
>               tableam_util_free_batch(scan, headBatch);
>
> +             /*
> +              * If we're about to release the batch that prefetchPos 
> currently
> +              * points to, just invalidate prefetchPos.  We'll reinitialize 
> it
> +              * using scanPos if and when heapam_getnext_stream is next 
> called. (We
> +              * must avoid confusing a prefetchPos->batch that's actually 
> before
> +              * headBatch with one that's after nextBatch due to uint8 
> overflow;
> +              * simplest way is to invalidate prefetchPos like this.)
> +              */
> +             if (prefetchPos->valid &&
> +                     prefetchPos->batch == batchringbuf->headBatch)
> +                     prefetchPos->valid = false;
> +

What causes us to reach this condition? Seems like we should take care that
the scan position doesn't overtake the prefetch position?  Does this only
happen when paused?



> +/*
> + * heapam_getnext_stream
> + *           return the next block to pass to the read stream
> + *
> + * The initial batch is always loaded by heapam_batch_getnext_tid.  We don't
> + * get called until the first read_stream_next_buffer() call, when a heap
> + * block is requested from the scan's stream for the first time.
> + *
> + * The position of the read_stream is stored in prefetchPos.  It is typical 
> for
> + * prefetchPos to consistently stay ahead of the scanPos position that's 
> used to
> + * track the next TID to be returned to the scan by heapam_batch_getnext_tid
> + * after the first time we get called.  However, that isn't a precondition.
> + * There is a strict postcondition, though: when we return we'll always leave
> + * scanPos <= prefetchPos (except in cases where we return 
> InvalidBlockNumber).
> + */
> +static BlockNumber
> +heapam_getnext_stream(ReadStream *stream, void *callback_private_data,
> +                                       void *per_buffer_data)
> +{
> +     IndexScanDesc scan = (IndexScanDesc) callback_private_data;
> +     IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
> +     BatchRingBuffer *batchringbuf = &scan->batchringbuf;
> +     BatchRingItemPos *scanPos = &batchringbuf->scanPos;
> +     BatchRingItemPos *prefetchPos = &batchringbuf->prefetchPos;
> +     ScanDirection xs_read_stream_dir = hscan->xs_read_stream_dir;
> +     IndexScanBatch prefetchBatch;
> +     bool            fromScanPos = false;
> +
> +     /*
> +      * scanPos must always be valid when prefetching takes place.  There has
> +      * to be at least one batch, loaded as our scanBatch.  The scan 
> direction
> +      * must be established, too.
> +      */
> +     Assert(index_scan_batch_count(scan) > 0);
> +     Assert(scan->MVCCScan);
> +     Assert(scanPos->valid);
> +     Assert(!hscan->xs_paused);
> +     Assert(xs_read_stream_dir != NoMovementScanDirection);
> +
> +     /*
> +      * prefetchPos might not yet be valid.  It might have also fallen behind
> +      * scanPos.  Deal with both.
> +      *
> +      * If prefetchPos has not been initialized yet, that typically indicates
> +      * that this is the first call here for the entire scan.  We initialize
> +      * prefetchPos using the current scanPos, since the current scanBatch
> +      * item's TID should have its block number returned by the read stream
> +      * first.

Just for my understanding: I assume right now we'll always at the start of a
scan batch when prefetchPos is invalid?


> It's likely that prefetchPos will get ahead of scanPos before
> +      * long, but that hasn't happened yet.
> +      *
> +      * It's also possible for prefetchPos to "fall behind" scanPos, at least
> +      * in a trivial sense: if many adjacent items are returned that contain
> +      * TIDs that point to the same heap block, scanPos can actually overtake
> +      * prefetchPos (prefetchPos can't advance until the scan actually calls
> +      * read_stream_next_buffer).  Reinitializing from scanPos is enough to
> +      * ensure that prefetchPos still fetches the next heap block that 
> scanPos
> +      * will require (prefetchPos can never fall behind "by more than one 
> group
> +      * of items that all point to the same heap block", so this is safe).

I guess this could (not necessarily should) be avoided by having
heapam_getnext_stream() advance the prefetchPos beyond a block of runs with
the same buffer before returning?  If we did that, we should never fall behind
anymore, is that right?



> +      * Note: when heapam_batch_getnext_tid frees a batch that prefetchPos
> +      * points to, it'll invalidate prefetchPos for us.  This removes any
> +      * danger of prefetchPos.batch falling so far behind scanPos.batch that 
> it
> +      * wraps around (and appears to be ahead of scanPos instead of behind 
> it).
> +      */
> +     if (!prefetchPos->valid ||
> +             index_scan_pos_cmp(prefetchPos, scanPos, xs_read_stream_dir) < 
> 0)
> +     {
> +             hscan->xs_prefetch_block = InvalidBlockNumber;
> +             *prefetchPos = *scanPos;
> +             fromScanPos = true;
> +
> +             /*
> +              * We must avoid holding on to any batch's buffer pin for more 
> than an
> +              * instant, to avoid undesirable interactions with the scan's 
> read
> +              * stream.  batchImmediateUnguard scans always get this behavior
> +              * automatically.  Other types of scans (these are all 
> index-only
> +              * scans in practice) are made to drop their buffer pin eagerly
> +              * through a policy of always eagerly setting all the batch 
> item's
> +              * visibility info in one go.
> +              */
> +             if (scan->xs_want_itup)
> +             {
> +                     HeapBatchData *hbatch;
> +
> +                     /* Make heapam_batch_resolve_visibility release 
> resources eagerly */
> +                     hscan->xs_vm_items = scan->maxitemsbatch;
> +
> +                     /* Make sure that this new prefetchBatch has no 
> resources held */
> +                     prefetchBatch = index_scan_batch(scan, 
> prefetchPos->batch);
> +                     hbatch = heap_batch_data(prefetchBatch, scan);
> +
> +                     /* Set visibility info not set through scanBatch */
> +                     heapam_batch_resolve_visibility(scan, 
> xs_read_stream_dir,
> +                                                                             
>         prefetchBatch, hbatch,
> +                                                                             
>         prefetchPos);

Wonder if it's worth somehow asserting that after this the page is actually
unguarded after the call.


> +     prefetchBatch = index_scan_batch(scan, prefetchPos->batch);
> +     for (;;)
> +     {
> +             BatchMatchingItem *item;
> +             BlockNumber prefetch_block;
> +
> +             /*
> +              * We never call amgetbatch without immediately releasing the 
> batch's
> +              * index AM resources (which requires special care during 
> index-only
> +              * scans).  The read stream is sensitive to buffer shortages, 
> so we
> +              * defensively avoid anything that visibly affects the 
> per-backend
> +              * buffer limit.
> +              */

It's probably just me (not the read stream) being sensitive, but I'd say that
the read stream tries to be careful about not pinning too many buffers, and
that's harder to do reliably if there are variable numbers of pins taken
without such care...



> +             else if (!index_scan_pos_advance(xs_read_stream_dir,
> +                                                                             
>  prefetchBatch, prefetchPos))
> +             {
> +                     /*
> +                      * Ran out of items from prefetchBatch.  Try to advance 
> to the
> +                      * scan's next batch.
> +                      */
> +                     if (unlikely(index_scan_batch_full(scan)))
> +                     {
> +                             /*
> +                              * Can't advance prefetchBatch because all 
> available
> +                              * batchringbuf batch slots are currently in 
> use.
> +                              *
> +                              * Deal with this by momentarily pausing the 
> read stream.
> +                              * heapam_batch_getnext_tid will resume the 
> read stream later,
> +                              * though only after scanPos has consumed all 
> remaining items
> +                              * from scanBatch (at which point scanBatch 
> will be freed,
> +                              * making its slot available for reuse by a 
> later batch).
> +                              *
> +                              * In practice we hardly ever need to do this.  
> It would be
> +                              * possible to avoid the need to pause the read 
> stream by
> +                              * dynamically allocating slots, but that would 
> add complexity
> +                              * for no real benefit.
> +                              */
> +                             hscan->xs_paused = true;
> +                             return read_stream_pause(stream);
> +                     }

Just for my understanding: What is the easiest way to hit this?  I assume a
well correlated indexscan with small heap tuples (and thus few heap pages) is
the most obvious path?


> +                     prefetchBatch = heapam_batch_getnext(scan, 
> xs_read_stream_dir,
> +                                                                             
>                  prefetchBatch, prefetchPos);
> +                     if (!prefetchBatch)
> +                     {
> +                             /*
> +                              * Failed to load next batch, so all the 
> batches that the scan
> +                              * will ever require (barring a change in scan 
> direction) are
> +                              * now loaded
> +                              */
> +                             return InvalidBlockNumber;
> +                     }

I assume failed means that we reached the end of the scan? If so failed sounds
a bit odd.



> +                     /* Position prefetchPos to the start of new 
> prefetchBatch */
> +                     index_scan_pos_nextbatch(xs_read_stream_dir,
> +                                                                      
> prefetchBatch, prefetchPos);
> +
> +                     if (scan->xs_want_itup)
> +                     {
> +                             HeapBatchData *hbatch = 
> heap_batch_data(prefetchBatch, scan);
> +
> +                             /* make sure we have visibility info for the 
> entire batch */
> +                             Assert(hscan->xs_vm_items == 
> scan->maxitemsbatch);
> +                             heapam_batch_resolve_visibility(scan, 
> xs_read_stream_dir,
> +                                                                             
>                 prefetchBatch, hbatch,
> +                                                                             
>                 prefetchPos);

It's a bit sad to have this code duplicated between "starting up prefetching"
and this. But it's also not entirely obvious how to do so differently.



> From 01164d1c646d441b508fb7b67fc8bea7583cb222 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <[email protected]>
> Date: Sun, 18 Jan 2026 11:32:52 -0500
> Subject: [PATCH v16 16/17] Add fake LSN support to hash index AM.
>
> This is preparation for an upcoming patch that will add the amgetbatch
> interface and switch hash over to it (from amgettuple).  We need fake
> LSNs to make it safe to apply behavior that is equivalent to nbtree's
> previous dropPin behavior that works with unlogged hash index scans.
>
> The commit that will add hashgetbatch will replace _hash_kill_items with
> a new hashkillitemsbatch routine.  This will be very similar to the
> btkillitemsbatch routine added by commit XXXXX.  In particular, it will
> use the same "did the index page's LSN change since the page was first
> read?" trick.
>
> Author: Peter Geoghegan <[email protected]>
> Discussion: 
> https://postgr.es/m/cah2-wzkehuhxyua8quc7rrn3etnxpiksjpfo8mhb+0dr2k0...@mail.gmail.com

If you reformulated the commit message to not reference the commit I see no
real reason to hold off pushing this?

Admittedly I only skimmed it, but it looks like it looks pretty much the same
as when I last looked at it (which is fine, I think it was ready to commit
then, except for the dependency on XLogGetFakeLSN, which hadn't yet been
committed)?



> @@ -372,6 +374,7 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer 
> metabuf, Buffer buf)
>       Page            page = BufferGetPage(buf);
>       HashPageOpaque pageopaque;
>       HashMetaPage metap;
> +     XLogRecPtr      recptr;
>
>       /* Scan each tuple in page to see if it is marked as LP_DEAD */
>       maxoff = PageGetMaxOffsetNumber(page);

I guess here you could move recptr up only one level, not two, but whatever.


> @@ -510,7 +512,11 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, 
> Buffer ovflbuf,
..
> +     else                                            /* 
> !RelationNeedsWAL(rel) */
> +     {
> +             recptr = XLogGetFakeLSN(rel);
> +
> +             /* Determine if wbuf is modified */
> +             if (nitups > 0)
> +                     mod_wbuf = true;
> +             else if (is_prev_bucket_same_wrt)
> +                     mod_wbuf = true;
> +     }

Any reason to not just compute mod_wbuf before the if (NeedsWal()), just like
you did for is_prim_bucket_same_wrt etc?


Greetings,

Andres Freund

Re: index prefetching

Reply via email to