Hi, On 2026-03-21 21:17:08 -0400, Peter Geoghegan wrote: > On Sat, Mar 21, 2026 at 3:54 PM Andres Freund <[email protected]> wrote: > > I'm still wondering about whether 0003 can be split apart - it's just huge > > and > > does a lot of things at once. I find it hard to get patches of that size > > into > > a mergeable state at once. > > > > What about splitting out the changes to the indexscan API, i.e. moving IOS > > handling to indexam.c / heapam_handler.c into its own commit? I feel like > > that's a sizeable chunk that needs to touch a lot of places that don't need > > to > > be touched in the commit adding batching. And it's such an architectural > > improvement that it's a worthwhile thing to do on its own. > > I'll try to get that done for v18 (the version *after* the next > version I'll post, not the next version/v17). I want to commit the 2 > patches that are now ready ahead of producing a v17. I don't want to > wait for a solution to this unrelated problem (nor do I want to leave > CFTester unhappy).
Makes sense and great! > > Just reminded me: At some point soon we're going to have to adjust the > > costing > > model to account for prefetching. I think there will be a fewer cases where > > bitmap scans are the correct choice than there are today... > > I think so too. > > Obviously, the main benefit here is significantly improved > performance. But ISTM that there's a second order benefit: the > uncertainty about how much slower an index scan will be in the worst > case is likely much lower. This is particularly true of index-only > scans that require at least some heap fetches; it's difficult to > predict how many a given scan will perform. While it isn't all that > bad when the heap pages are in shared_buffers, it's much much slower > when they aren't. > > The optimizer has only a very basic understanding of how much data is > likely cached in shared_buffers. Yea, it's a surprisingly naive model. I'd argue it doesn't actually model that anything is initially in s_b, just whether repated acceses within one query are likely to still be in s_b. I do think we'll eventually have to do a *bit* better than that. > It's very hard to improve that model because it relies so much on things > that can change quickly. Making the true cost profile of index scans simpler > and more predictable might make the optimizer's cost model much more > accurate (perhaps with a little work on the cost function side). It's just > easier to fit a cost function to that profile because being wrong (about IoS > heap fetches or correlation) matters significantly less. There are two things that I think we will eventually have to take into account: 1) The amount of synchronous or effectively synchronous IO: a) Once index prefetching is in, the cost of random table IO in a large ordered index scan is drastically reduced, even if the index and table are not well correlated. We probably need to lower the costing to take that into account. b) However, if you have a badly correlated index nestloop scan, where there's only a single tuple on the inner side, there's nothing prefetching can do, you'll have the cost of a synchronous IO every time. For this we probably need to *increase* the cost. c) Similarly, if an index is so large that it's unlikely to be in s_b, we probably need to better account for the synchronous random index IO. That's probably not quite as commonly an issue as the two above, but I've seen index IO cost dramatically undercosted in such scenarios. 2) The expected sizes of IOs It turns out to matter a fair bit for performance, particularly if data is in the kernel page cache vs needing to actually do IO, whether the average IO size is 8kB or 128kB. A large portion of the CPU cost of an IO is constant, independent of the size of the IO. I think we'll have a hard time avoiding wrong plan choices if we don't take that into account when costing the effect of IO. > > I suspect there's a good argument for moving the decision about whether to > > enable prefetching to the planner instead of doing it here. But doing it > > here > > doesn't have a lot of architectural impact for now, so we don't necessarily > > have to move the point of decision making now. > > I think that it makes sense for the optimizer to give advice without > being prescriptive. Basically something a bit like the tuples_needed > patch, that table AMs are free to ignore when they find the hint to be > wrong. Agreed. Clearly the table AM may know a bit more about when it makes sense to do prefetching or not IF it knows how many rows are expected to be fetched. I suspect this means that the planner should ask the table AM for information about expected IO costs for an index scan, including the effect of prefetching, at plan time. Just choosing whether to enable prefetching at execution time (or at createplan time), can't lead to a better path being chosen. Although I have to admit that it I am concerned that doing any of this well for in the presence of thigns like anti-joins or LIMITs seems nontrivial, given that we don't know anything at those lower levels about only few tuples ending up getting fetched. > > > @@ -192,7 +210,14 @@ heapam_index_fetch_tuple(struct IndexFetchTableData > > > *scan, > > > if (BufferIsValid(hscan->xs_cbuf)) > > > ReleaseBuffer(hscan->xs_cbuf); > > > > > > - hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, > > > hscan->xs_blk); > > > + /* > > > + * When using a read stream, the stream will already know > > > which block > > > + * number comes next (though an assertion will verify a > > > match below) > > > + */ > > > + if (hscan->xs_read_stream) > > > + hscan->xs_cbuf = > > > read_stream_next_buffer(hscan->xs_read_stream, NULL); > > > + else > > > + hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, > > > hscan->xs_blk); > > > > > > /* > > > * Prune page when it is pinned for the first time > > > > The assertion mentioned is only after the call to heap_pageprune_opt(). > > Which > > is probably fine, but perhaps it's worth just doing it immediately after the > > above if block, so if the assertion is triggered one can figure out quickly > > it's the fault of the read stream getting confused. > > But that means that it won't be triggered when we don't enter the "if > (hscan->xs_blk != ItemPointerGetBlockNumber(tid))" block that contains > all this code. Besides, it just doesn't seem possible that > heap_page_prune_opt would release its caller's pin. I was more concerned about read_stream_next_buffer() returning the wrong block, due to prefetching somehow "desynchronizing" with the scan position and catching that when it's clear that we just read a new block, rather than in a place where it could be either the continuation of a scan on the same page or a new page. > > > + /* > > > + * If we're about to release the batch that prefetchPos > > > currently > > > + * points to, just invalidate prefetchPos. We'll > > > reinitialize it > > > + * using scanPos if and when heapam_getnext_stream is next > > > called. (We > > > + * must avoid confusing a prefetchPos->batch that's > > > actually before > > > + * headBatch with one that's after nextBatch due to uint8 > > > overflow; > > > + * simplest way is to invalidate prefetchPos like this.) > > > + */ > > > + if (prefetchPos->valid && > > > + prefetchPos->batch == batchringbuf->headBatch) > > > + prefetchPos->valid = false; > > > + > > > > What causes us to reach this condition? > > As a general rule, prefetchPos can only advance in the read stream > callback. It is mostly up to the read stream callback to notice when > prefetchPos has fallen behind so it can reinitialize from scanPos. But > we need this extra check to make its approach robust against > prefetchPos->batch getting so far behind that uint8 wrap makes it look > like it's ahead instead of behind. > > Importantly, while prefetchPos can fall behind scanPos, it's only > possible in a limited, trivial sense: it can only fall behind for as > long as we go without calling the read stream callback. Once the read > stream callback is called, we can rely on its postcondition: when it > returns, prefetchPos must be >= scanPos. So, in effect, everything > behaves exactly as it would if it really was impossible for > prefetchPos to be < scanPos for even a picosecond (we don't want to > have to check prefetchPos needlessly in hot code paths). > > Seems like we should take care that > > the scan position doesn't overtake the prefetch position? > > That's precisely what we're doing here. We're just not checking > prefetchPos on every item returned (we only do so when advancing to > the next batch) to save cycles. I think I had largely missed the "danger" of index only scans here. I think it'd be good to call that out more explicitly in these comments. > > Does this only happen when paused? > > This "prefetchPos->valid = false" stuff is approximately the opposite > of pausing. Pausing resolves the problem of prefetchPos getting so far > ahead of scanPos that the batch ring buffer runs out of slots. Whereas > this prefetchPos invalidation code helps the read stream deal with > prefetchPos falling behind scanPos. Because I had somewhat missed the real cause of the problem - not calling the read stream code due to index only scans - I thought that somehow we could end up in this state due to not resuming prefetching before the scan position overtakes the prefetch position. But I don't think that actually happen. > > Wonder if it's worth somehow asserting that after this the page is actually > > unguarded after the call. > > We used to, but the new layering forced me to remove it. Any ideas > about how to add it back? Adding an "isGuarded" field to IndexScanBatchData would be the easiest way. That way we can make assertions about the state without knowing anything about the internal mechanism of how guarding is implemented. I doubt setting/clearing that field even when assertions are disabled will be measurable, as long as you place it alongside the other booleans where there's padding space available. > > > + hscan->xs_paused = true; > > > + return read_stream_pause(stream); > > > + } > > > > Just for my understanding: What is the easiest way to hit this? I assume a > > well correlated indexscan with small heap tuples (and thus few heap pages) > > is > > the most obvious path? > > Depends on what you mean. If you just want to test it quickly, then > you can easily do so by temporarily reducing INDEX_SCAN_MAX_BATCHES to > 2 (the minimum usable value). If you meant to ask the most likely way > this will happen in the real world, I can think of a couple of things. > > Possibly, the easiest way is to perform an index-only scan that > requires only a few heap fetches. These fetches can be spaced out such > that we have a hard time maintaining an adequate prefetch distance, > and have to read many leaf pages just to do one more fetch. (In > general, these kinds of index-only scans might be the best way to > create extreme conditions in the read stream callback.) After replacing the pause with an error I found that it's surprisingly easy to hit on slow storage (or on fast storage if you set needed_wait=true in read_stream_next_buffer()). I've not done any performance validation on whether that means the limit is too low. Greetings, Andres Freund
