Re: index prefetching

Peter Geoghegan Tue, 24 Mar 2026 18:35:35 -0700

On Tue, Mar 24, 2026 at 1:27 PM Andres Freund <[email protected]> wrote:
> > But that means that it won't be triggered when we don't enter the "if
> > (hscan->xs_blk != ItemPointerGetBlockNumber(tid))" block that contains
> > all this code. Besides, it just doesn't seem possible that
> > heap_page_prune_opt would release its caller's pin.
>
> I was more concerned about read_stream_next_buffer() returning the wrong
> block, due to prefetching somehow "desynchronizing" with the scan position and
> catching that when it's clear that we just read a new block, rather than in a
> place where it could be either the continuation of a scan on the same page or
> a new page.


Then I don't follow. The existing assertions will catch that (I should
know, they've failed enough times during development).

Basically, I don't get the concern about heap_page_prune_opt releasing
its caller's pin. Even if that happened, the existing assertions would
still catch it.

> I think I had largely missed the "danger" of index only scans here. I think
> it'd be good to call that out more explicitly in these comments.

Will do.

> > > Does this only happen when paused?
> >
> > This "prefetchPos->valid = false" stuff is approximately the opposite
> > of pausing. Pausing resolves the problem of prefetchPos getting so far
> > ahead of scanPos that the batch ring buffer runs out of slots. Whereas
> > this prefetchPos invalidation code helps the read stream deal with
> > prefetchPos falling behind scanPos.
>
> Because I had somewhat missed the real cause of the problem - not calling the
> read stream code due to index only scans - I thought that somehow we could end
> up in this state due to not resuming prefetching before the scan position
> overtakes the prefetch position. But I don't think that actually happen.

Right, it can't happen. In any case the assertions we have are quite
effective at catching problems like that. For example, if we don't
resume prefetching and consume another batch, there's an assertion for
that. Actually, there's more than one. There's a direct assertion, on
the scan side. And the read stream callback itself has a precondition
assertion that the read stream is not paused.

> > > Wonder if it's worth somehow asserting that after this the page is 
> > > actually
> > > unguarded after the call.
> >
> > We used to, but the new layering forced me to remove it. Any ideas
> > about how to add it back?
>
> Adding an "isGuarded" field to IndexScanBatchData would be the easiest
> way. That way we can make assertions about the state without knowing anything
> about the internal mechanism of how guarding is implemented.
>
> I doubt setting/clearing that field even when assertions are disabled will be
> measurable, as long as you place it alongside the other booleans where there's
> padding space available.

I've prototyped that, and it works well. It'll be in v18.

> After replacing the pause with an error I found that it's surprisingly easy to
> hit on slow storage (or on fast storage if you set needed_wait=true in
> read_stream_next_buffer()).  I've not done any performance validation on
> whether that means the limit is too low.

It's been a while since I last validated performance to justify the
current maximum number of batches. I used buffered I/O for that. I'm
sure that a higher maximum with very slow storage and a very high
effective_io_concurrency will provide some benefit. But perfectly
handling that isn't essential for the first committed version of index
prefetching.

I must admit I'm unsure how to evaluate the maximum number of batches.
It can make sense to pursue diminishing returns. But up to what point,
and according to what principle?

-- 
Peter Geoghegan

Re: index prefetching

Reply via email to