Re: index prefetching

Peter Geoghegan Sat, 21 Mar 2026 16:02:36 -0700

First of all, thanks for the review.

I can immediately act on most of what you've said here. I'm planning
to commit the first patch (and maybe the hash index fake LSN patch) in
the next couple of days, ahead of posting a new v17 of the patch set.
You can expect any item I reply to with "fixed" or similar to be in
that version. Other items might not be addressed in v17 -- generally
because I require more context or feedback to act.

I'm going to work through both this email and the later one from today
before posting this v17. What I've said applies equally to both
reviews/emails.

On Fri, Mar 20, 2026 at 8:33 PM Andres Freund <[email protected]> wrote:
> > There might be some concerns about the distributed costs of increased
> > code size.
>
> I'm not particularly concerned about that - I think it actually might often
> *increase* the icache hit ratio, because without this specialization every
> scan has to jump over a fair bit of code not used during that scan.

That certainly seems likely to me, since we're sensitive to icache
misses in general. But this change was a win across the board.

I was mostly just acknowledging that using pg_attribute_always_inline
this aggressively is unorthodox. This will be the first code that
actually uses pg_attribute_hot, too.

> > Subject: [PATCH v16 01/17] Make IndexScanInstrumentation a pointer in 
> > executor
> >  scan nodes.
> LGTM

Cool. I will definitely push this one soon.

> > This is preparatory work for an upcoming commit that will need xs_blk
> > to manage buffer pin transfers between the scan and the executor slot.
>
> A subsequent commit adds an earlier ExecClearTuple(slot) to make the buffer
> refcount decrement cheaper (due to hitting the one-element cache in
> bufmgr.c). I wonder if it's worth pulling that into this commit? Mostly to
> make that larger commit smaller.

Is it really worth doing that without also doing the
xs_lastinblock/ExecStorePinnedBufferHeapTuple stuff? We need batches
to do the latter.

v17 will do it the other way around: it'll delay adding the
ExecClearTuple(slot)  as well as the
xs_lastinblock/ExecStorePinnedBufferHeapTuple stuff until a *later*
commit. That seems more logical to me.

> The commit message doesn't mention that this affects ammarkpos/amrestrpos.

It does not. But FWIW there's a prominent "Note" about it in the SGML docs.

> Does that imply that index AMs that want to support mark/restore need to
> switch away from amgettuple?

Yes. It also means that amgetbatch-based index AMs are *required* to
support mark + restore.

For nbtree, we need the new amposreset function to safely handle this
(and to manage cross-batch changes in the scan's direction). For hash,
which lacks state to independently track the scan's progress outside
of batches (i.e. no array keys), we don't need to do anything;
mark/restore support is automatic.

Of course there is zero need for hash to support mark/restore. But it
is formally supported.

> Probably fine, but worth mentioning, I think.

I'll mention it in the commit message, then.

> I wonder if these should be in a separate header, a reasonably large number of
> files include genam.h due to systable_beginscan() etc, and those should not
> need to know about this stuff.

I'll add a new header for these.

> FWIW, Melanie is going to use xs_vmbuffer for on-access pruning soon, maybe
> worth not phrasing it as specific to IOSs as it will need to be revised 
> anyway.

Will fix.

> > + #define HEAP_BATCH_VIS_CHECKED              0x01    /* checked item in 
> > VM? */
> > + #define HEAP_BATCH_VIS_ALL_VISIBLE  0x02    /* block is known 
> > all-visible? */
>
> So we only 2 out of 8 bits. I guess it's probably not worth doing bit shift
> stuff. But still curious if you tried?

No, I didn't try. I doubt saving space is worthwhile, since we'll need
a relatively large allocation for batches used during index-only scans
regardless.

> > +static inline BTBatchData *
> > +BTBatchGetData(IndexScanBatch batch)
> >  {
> > +     return (BTBatchData *) ((char *) batch - 
> > MAXALIGN(sizeof(BTBatchData)));
> > +}
>
> I think it maybe worth threading this through a generic helper macro taking
> IndexScanDesc, IndexScanBatch, and the type of the private patch data that
> does the pointer math and also can assert that
>   iscan->batch_index_opaque_size == MAXALIGN(sizeof(BTBatchData))

Will have that in the next version.

> > +typedef struct IndexScanBatchData
> > +{
> > +     XLogRecPtr      lsn;                    /* index page's LSN */
>
> Still doubtful this should be in generic infra (together with
> indexam_util_batch_unlock()).

We're now advertising that indexam_util_batch_unlock is optional, and
that index AMs can go there own way when needed.

Basically, I just don't see much point in having 2 identical copies of
indexam_util_batch_unlock. It wouldn't be terrible to give nbtree and
hash their own copies, but it means we cannot centralize discussion of
things like the batchImmediateUnguard rules. With
indexam_util_batch_unlock, the place that handles those rules for
index AMs is right next to the place that does so for table AMs.

> > +/*
> > + * Maximum number of batches (leaf pages) we can keep in memory.  We need a
> > + * minimum of two, since we'll only consider releasing one batch when 
> > another
> > + * is read.
> > + *
> > + * The current maximum of 64 batches is somewhat of an arbitrary limit.  
> > Very
> > + * few scans ever get near to this limit in practice.
> > + */
> > +#define INDEX_SCAN_MAX_BATCHES               64
> > +#define INDEX_SCAN_CACHE_BATCHES     2
>
> Just curious: How did you end up with INDEX_SCAN_CACHE_BATCHES == 2?

Mostly trial and error. We don't want fully cached scans to repeatedly
palloc + pfree a batch. Two cache slots should suffice to reliably
reach that steady state.

It also seems to work well with uncached scans. I have custom
instrumentation that shows good "batch cache recycling hit rates" for
those.

> > +typedef struct BatchRingBuffer
> > +{

> Is nextBatch basically the tail?

Formally, nextBatch - 1 is the tail.

> The "only actually usable" bit is a bit confusing, given that
> index_scan_batch_full() is implemented by
>   (nextBatch - headBatch) == INDEX_SCAN_MAX_BATCHES
>
>
> Maybe worth having a static assert somewhere to ensure that
> INDEX_SCAN_MAX_BATCHES isn't set to a value bigger than what fits into a
> uint8?

Added a couple of static assertions for that.

> > +index_scan_batch_append(IndexScanDescData *scan, IndexScanBatch batch)

> Perhaps worth an assert ensuring that there's space in the ring buffer?

Added one.

> > +
> > +     /*
> > +      * heap blocks fetched counts (incremented by index_getnext_slot calls
> > +      * within table AMs, though only during index-only scans)
> > +      */
> > +     uint64          nheapfetches;
> >  } IndexScanInstrumentation;
>
> s/heap/table/ for anything new imo.

Usually I'd agree, but here we're using the same name as the one
presented in EXPLAIN ANALYZE. IMV this should match that. (Maybe the
EXPLAIN ANALYZE output should change, and this field name along with
it, but that's another discussion entirely.)

> > +/*
> > + * heapam_batch_resolve_visibility
> > + *           Obtain visibility information for a TID from caller's batch.
>
>
> I really dislike these pointless restatements of the function name and the
> weird indentation of the function "title" that's indented differently across
> files and in the file.  Since they're not in use in heapam_handler.c can we
> please not introduce them now?

Fixed both places that looked like that.

> > + * Note on Memory Ordering Effects
> > + * -------------------------------

> ISTM that moving this comment here is hiding it too much, I'm pretty sure
> there are other AMs using visibilitymap out there.  Perhaps we should just
> move it to the top of visibilitymap.c?

I think that the most natural place for it in visibilitymap.c is just
above visibilitymap_get_status. They'll be moved over there in the
next revision.

> > +static void
> > +heapam_batch_resolve_visibility(IndexScanDesc scan, ScanDirection 
> > direction,
> > +                                                             
> > IndexScanBatch batch, HeapBatchData *hbatch,
> > +                                                             
> > BatchRingItemPos *pos)
> > +{

> I'm surprised this is better than two loops, tbh. It makes the actual looping
> code more complicated, with higher register pressure.

I seem to recall that it was, but that was weeks ago. Let me get back
to you on this (won't be in the next version, most likely).

> > +static pg_attribute_hot IndexScanBatch
> > +heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction,
> > +                                      IndexScanBatch priorBatch, 
> > BatchRingItemPos *pos)
> > +{

> > +static pg_attribute_hot pg_attribute_always_inline ItemPointer
> > +heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan,
> > +                                              ScanDirection direction, 
> > bool *all_visible)
> > +{

> A lot of this also seems like it should be generic code.  Where it actually
> starts to be heapam specific is in heapam_batch_return_tid() - and there's two
> calls to that.  So maybe there should be generic helper for the bulk of
> heapam_batch_getnext_tid() that's called by the heapam version, which either
> returns a NULL upwards or does a single heapam_batch_return_tid().

I agree that it might very well make sense to break down
heapam_batch_getnext and heapam_batch_getnext_tid into inline
functions that allow for some amount of code reuse. But I'm not going
to have time to fix that before the next revision (too much
performance validation is required).

I'll prioritize simplifying and slimming down all of the specialized
versions of functions like heapam_index_getnext_slot (which you ask
about later). That is way too complicated right now, and inlining
helper functions seems like a promising alternative approach.

Once I'm happy with that, I'll return to this item. Not gonna be in
the next version, though.

> From what I can tell it doesn't make sense to specify both pg_attribute_hot
> pg_attribute_always_inline as the the pg_attribute_hot seems to be disregarded
> once inlined into the caller.

Fixed.

> Have you thought about implementing the different variants of this by having
> one common always-inline implementation that is called by the different
> variants, with constant arguments to decide between e.g. index_getnext_tid()
> and heapam_batch_getnext_tid()?

I'll look into it. As I said, that will be my priority for the next
revision (not making functions like heapam_batch_getnext more table AM
agnostic).

> I find it pretty hard to keep the flow below
> heapam_index_[only_][batch_]getnext_slot() straight.
>
>  - heapam_index_fetch_heap()
>  - heapam_index_fetch_tuple()
>  - heapam_batch_return_tid()
>  - heapam_batch_getnext()
>  - heapam_batch_getnext_tid()
>
> Maybe I'm just weak brained today, but from the naming I have no idea what the
> difference between e.g. heapam_index_fetch_heap(), heapam_index_fetch_tuple()
> could be and similarly the difference between heapam_batch_return_tid() and
> heapam_batch_getnext_tid() is unclear, and heapam_batch_getnext() isn't
> particularly clear either.

Yes, it's pretty unclear. You can tell that I rushed this part.

> I also suspect it'd be worth creating a new heapam.c file for this new
> code. heapam_index.c or such.

I had thought about that myself. How would that be structured, in
terms of the commits? It's pretty clear that we'd have to move some
existing heapam_handler.c code as part of this whole process (e.g.,
heapam_index_fetch_tuple).

I think that it would likely be easiest if we added an extra commit,
right after "heapam: Track heap block in IndexFetchHeapData using
xs_blk" and right before "Add interfaces that enable index
prefetching". This would be a strictly mechanical commit that moved
the code to the new file without adding anything else.

> xs_want_itup in a generic routine seems too low-level. The xs_ stuff is just
> the struct prefix that was once uniformly used for IndexScanDescData members,
> it doesn't seem to belong in a parameter.
>
> Could it just be an index_only_scan or such?

Fixed.

> > @@ -281,7 +280,23 @@ index_beginscan(Relation heapRelation,
> >        */
> >       scan->heapRelation = heapRelation;
> >       scan->xs_snapshot = snapshot;
> > +     scan->MVCCScan = IsMVCCLikeSnapshot(snapshot);
>
> Elsewhere you have an xxx comment about making sure the snapshot is
> pushed/registered - seems it should be here, not there...

That comment was copied from index_getnext_tid (it's still there in
the patch/hasn't been moved). This was based on the theory that it
made just as much sense in the new code paths, which generally won't
call index_getnext_tid. And, there are numerous other identical
comments, including in procarray.c.

I'm fine with consolidating all this, but I'd prefer it if you gave me
guidance on what to do with all of them.

> Any reason this isn't in index_beginscan_internal(), given both
> index_beginscan() and index_beginscan_parallel() need it?  I realize you'd
> need to add arguments to index_beginscan_internal(), but I don't see a problem
> with that.  Alternatively a helper for this seems like a possibility too.

Fixed, by moving much more of the initialization done by each variant
(index_beginscan, index_beginscan_bitmap, index_beginscan_parallel)
into index_beginscan_internal itself.

> > + *
> > + * This should only be called by table AM's index_getnext_slot 
> > implementation,
> > + * and only given an index AM that supports the single-tuple amgettuple
> > + * interface.
> >   * ----------------
>
> Seems like this comment should also be turned into assertions?

Fixed.

> Hm. Would it be worth using a much wider ring position to avoid this kind of
> danger?

I don't think so. It worked that way a few months ago, but I felt
using a ring buffer with frequently overflowing offsets would be
easier to test. Note that we reset all offsets to 0 to handle a change
in scan direction, so the offsets aren't entirely immutable.

I like that approach; it explicitly creates the appearance of a ring
buffer whose scan direction never changed (it's as if the new
direction had always been in use). Going to those lengths might not be
necessary, but it seems more robust. Most code (almost all code in the
table AM) doesn't have to think about changes in scan direction at
all.

The slow path loop in index_batchscan_mark_pos that you were concerned
about will be removed in the next version. It wasn't really necessary;
we can safely compare the batch pointer at the given loaded batch slot
offset. Comparing batch pointers to each other seems pretty safe and
simple to me.

> Couldn't it be that markBatch is one behind scanBatch, in which case wouldn't
> need to throw out all the prefetched batches (in the future commits that do
> prefetching)?

We can very often use the "scanBatch == markBatch" happy path, since
in practice merge joins rarely need to restore a markPos that's very
far behind scanPos. This is made more likely by some of the nbtree
work from the past 8 years, such as suffix truncation and
deduplication. Equal index tuples naturally avoid spanning multiple
index leaf pages unless it's absolutely unavoidable because there are
too many to fit on one page.

> My understanding is that it'd be rather likely that, if the
> markPos is not the current batch, it'd be in the last batch.

I agree: cases that cannot use the "scanBatch == markBatch" happy path
are very likely to restore a markBatch for the batch that was most
recently removed from the head of the ring buffer. However, I'd much
rather avoid a special case where we don't remove markBatch from the
batch ring buffer, just in case we restore the mark (which probably
won't happen, since most marks taken are never restored). I get why
you'd ask about this (naturally I thought of it myself), but I just
can't see it paying for itself.

Consider the costs. The complexity stems from it breaking the
"scanBatch == headBatch" invariant. That would affect pausing. When
the read stream callback pauses, it expects the scan code to consume
all remaining items from scanBatch and then remove scanBatch from the
ring buffer, allowing the read stream to be resumed in passing. Like a
cross-batch restore, pausing is itself a hard-to-hit code path;
testing the intersection of those 2 things would be very tricky.
(There are other reasons why I want to keep this invariant, which I
won't detail now.)

I also don't see much benefit. We're already comfortably beating the
master branch with merge joins that heavily use mark/restore, due to
improved buffer pin management (particularly with index-only scans).
The need to re-read index pages in these cases is nothing new. We
reset the read stream whenever we restore a mark (even in the happy
path), so there's relatively little risk of the distance to get out of
hand afterwards (the index pages are very likely to still be in
shared_buffers each time).

> Do we have decent test coverage of queries changing scan direction? Seems like
> something that could quite easily have bugs.

You're right to be concerned about potential bugs here. And about the
existing test coverage.

While the regression tests have limited coverage, my personal test
suite covers many variations. Some of these come from my work on the
nbtree projects included in Postgres 17 and 18. Others were derived
from a stress-testing suite developed by Tomas that generates
cursor-based queries that randomly scroll back and forth, verifying
that the patch agrees with master at every step.

Improving the core regression tests in this area would be difficult. A
test case like this must focus on one specific problematic page
transition (or a small series of related transitions). Obviously, that
depends on the data being laid out precisely, which depends on many
implementation details. And we usually need quite a bit of data to
tickle the implementation in just the right/wrong way.

> > +void
> > +tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch)
> > +{

> Any reason to not use indexam_util_batch_release() to implement the release?
> Seems nicer to not have two copies of this code.

Yes: We don't always want to use the batch cache, even when it's
available. In particular, we don't want to do that during simple point
queries (like those from pgbench SELECT), when batches are freed only
as the scan ends.

IOW, when we're called through index_batchscan_end (specifically, when
we're called through index_batchscan_reset when it is called by
index_batchscan_end) we don't want to use the cache. It seemed to make
sense to implement this in a way that didn't require any special
handling from within indexam_util_batch_release (since it's not a
concern of index AMs).

> > +#if defined(USE_VALGRIND)
> >       if (!RelationUsesLocalBuffers(rel))
> > -             VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(buf), BLCKSZ);
> > +             VALGRIND_MAKE_MEM_NOACCESS(page, BLCKSZ);
> > +#endif
> >  }
>
> Is this a leftover from a larger change earlier? Because otherwise it's not
> obvious why this changed as part of this commit.

I wanted to make it as consistent as possible with the new nbtsearch.c
code, that also calls VALGRIND_MAKE_MEM_NOACCESS like this.

> > -                             info->amhasgettuple = (amroutine->amgettuple 
> > != NULL);
> > +                             info->amhasgetbatch = (amroutine->amgetbatch 
> > != NULL ||
> > +                                                                        
> > amroutine->amgettuple != NULL);
>
> Setting amhasgetbatch when not having amgetbatch seems pretty darn confusing.
> Why?

Fixed this by renaming the IndexOptInfo field that previously appeared
as 'amhasgetbatch'. It's now called 'amcanplainscan'.

> I don't think this should do two GetIndexAmRoutineByAmId() calls for the same
> AM.

Fixed.

-- 
Peter Geoghegan

Re: index prefetching

Reply via email to