Hi, On 2026-03-24 21:34:51 -0400, Peter Geoghegan wrote: > On Tue, Mar 24, 2026 at 1:27 PM Andres Freund <[email protected]> wrote: > > > But that means that it won't be triggered when we don't enter the "if > > > (hscan->xs_blk != ItemPointerGetBlockNumber(tid))" block that contains > > > all this code. Besides, it just doesn't seem possible that > > > heap_page_prune_opt would release its caller's pin. > > > > I was more concerned about read_stream_next_buffer() returning the wrong > > block, due to prefetching somehow "desynchronizing" with the scan position > > and > > catching that when it's clear that we just read a new block, rather than in > > a > > place where it could be either the continuation of a scan on the same page > > or > > a new page. > > Then I don't follow. The existing assertions will catch that (I should > know, they've failed enough times during development). > > Basically, I don't get the concern about heap_page_prune_opt releasing > its caller's pin. Even if that happened, the existing assertions would > still catch it.
My point wasn't that heap_page_prune_opt() would release the pin or such, but that an assertion failure in the "new block" case will tell you that it was definitely prefetching that resulted in you looking at the wrong block, rather than some state confusion leading to continuing on the last page when it wouldn't be right. An assertion that's after the if (changed block) doesn't tell you which of those two went wrong. But it really doesn't matter enough to continue discussing :) > > After replacing the pause with an error I found that it's surprisingly easy > > to > > hit on slow storage (or on fast storage if you set needed_wait=true in > > read_stream_next_buffer()). I've not done any performance validation on > > whether that means the limit is too low. > > It's been a while since I last validated performance to justify the > current maximum number of batches. I used buffered I/O for that. I'm > sure that a higher maximum with very slow storage and a very high > effective_io_concurrency will provide some benefit. But perfectly > handling that isn't essential for the first committed version of index > prefetching. Agreed. > I must admit I'm unsure how to evaluate the maximum number of batches. > It can make sense to pursue diminishing returns. But up to what point, > and according to what principle? I think the theoretical amount of required IO concurrency can be calculated based on the storage latency and IOPS. IIRC it is iops_qd1 = (1000 / latency_ms) queue_depth = IOPS / iops_qd1 queue_depth = IOPS / (1000 / latency_ms) Of course that's overly simplistic, as typically the latency increases the more IO you issue, increasing the required QD to actually fully utilize the available IOs. But it's a good approximation. IIRC the upper end of "SSD like" cloud storage latency is around 3.5ms and tends to top out around 20k IOPS. That's unfortunately volume limits, on larger instance you can stripe multiple disks to get higher IOPS. Faster tiers are ~0.25ms and capped at ~250k-400k IOPS. I think those are typically instance limits, but I'm not sure about that. So on the upper end of the cheaper tiers you need a queue depth of about 70, and on the upper end of the (very expensive) tiers you need about 100. That's to utilize all IOPS of course, so if you have multiple processes doing IO... For local NVMe SSDs latencies are around 8-30us and IOPS tops out at about 1.5M (although the latter is extremely hard to sustain, it's definitely only possible with DIO etc). That leaves you with QDs < 50 or so. So, to be able to fully utilize current hardware with one query, we need to be able to reach queue depth in the low hundreds, in the case of striped cheap cloud SSDs. That's when a backend *just* does IO, nothing else. Something like an index scan, will have its own limit to how much it can process in a second. If we can only do 100k IOPS while searching the index, fetching the heap tuples and processing them, we don't need to support the queue depths to support doing 1M IOPS within one backend. That's something that can presumably be quite easily experimentally ballparked: A fully cached, completely uncorrelated, index scan seems to be able to fetch about 1.5M page fetches on my ~6 YO server CPU with turbo boost disabled, when never looking at the results (i.e. using OFFSET) or immediately filtering away the row. So I'd guess the limit on newer CPUs in SKUs optimized for clock speed and boost enabled, is north of 2.5M pages/sec, higher than I'd have thought! That's without doing any IO though. With correlated scans the limit is much lower, maybe 150k, just because there's so many more tuples per page (and processing them trivially becomes the bottleneck). So, to support actually utilizing the full IO IO capability, we need to allow for enough batches to keep a few hundred IOs in flight at the very extreme end. I'd assume you have a much better idea to how many batches that translates to? Just testing a read stream of random 4kB IO I can fully saturate all the SSDs I have, up to ~700k IOPS of random IO. IIRC I tried this with a few striped SSDs in the past and got a bit higher than that. Greetings, Andres Freund
