On Thu, Mar 26, 2026 at 5:47 PM Andres Freund <[email protected]> wrote:> > I must admit I'm unsure how to evaluate the maximum number of batches. > > It can make sense to pursue diminishing returns. But up to what point, > > and according to what principle? > > I think the theoretical amount of required IO concurrency can be calculated > based on the storage latency and IOPS. IIRC it is > > iops_qd1 = (1000 / latency_ms) > queue_depth = IOPS / iops_qd1 > queue_depth = IOPS / (1000 / latency_ms)
> So, to be able to fully utilize current hardware with one query, we need to be > able to reach queue depth in the low hundreds, in the case of striped cheap > cloud SSDs. That's when a backend *just* does IO, nothing else. That sounds like a useful starting point. But empirically, as far as I can tell, the relationship between query latency and how close you are to fully saturating I/O is not linear, or anything like it. Maybe it's an S-curve? It seems that a fully I/O-bound query isn't remotely close to twice as fast as the same query when it is restricted to using only half the number of batches (half the number required to reach that saturation point). OTOH, using more batches than strictly necessary usually isn't much of a problem. So I don't think that we can rely on a precise formula, even if we're willing to make fixed assumptions about data layout (which we're not). If there were a good enough reason for index prefetching to use an unbounded number of batches, we could surely figure out a way to support that requirement. It'd be messy, and relatively hard to test. And I'd worry a bit about there being zero backstop for index-only scans. But it can be done. If we did things that way (which doesn't seem like a good idea right now), we wouldn't have to model I/O saturation at all. Which tbh makes me wonder if that kind of modelling has much practical use either way. > Something like an index scan, will have its own limit to how much it can > process in a second. If we can only do 100k IOPS while searching the index, > fetching the heap tuples and processing them, we don't need to support the > queue depths to support doing 1M IOPS within one backend. > > That's something that can presumably be quite easily experimentally > ballparked: > > A fully cached, completely uncorrelated, index scan seems to be able to fetch > about 1.5M page fetches on my ~6 YO server CPU with turbo boost disabled, when > never looking at the results (i.e. using OFFSET) or immediately filtering away > the row. > So I'd guess the limit on newer CPUs in SKUs optimized for clock > speed and boost enabled, is north of 2.5M pages/sec, higher than I'd have > thought! That's without doing any IO though. We've done good work on nbtree's ability to avoid provably unnecessary work in recent years; see _bt_set_startikey. What that means is that the majority of the index scans used to test the patch probably have _bt_readpage calls that spend most of their time simply collecting all of the TIDs from the leaf page, without any scan key overhead (barring an initial precheck within _bt_set_startikey once per _bt_readpage, to prove that the optimization is safe). With large posting list tuples, we'll do even less work, since they're just an array of ItemPointerData. > With correlated scans the limit is much lower, maybe 150k, just because > there's so many more tuples per page (and processing them trivially becomes > the bottleneck). > > > So, to support actually utilizing the full IO IO capability, we need to allow > for enough batches to keep a few hundred IOs in flight at the very extreme > end. I'd assume you have a much better idea to how many batches that > translates to? I can give you a range. The problem is that it's a range starting from "absurdly optimistic" through to "absurdly pessimistic". Neither extreme is very unlikely (there's a wide natural variation in workloads), and it's hard to argue usefully about what will be true in most cases. In short, I can tell you plenty, but nothing that seems particularly useful for determining how many batches we should cap the ring buffer at. I don't think there's anything fundamentally objectionable about our deriving the current maximum of 64 through trial and error. I assume that INDEX_SCAN_MAX_BATCHES must be constrained to a low-ish power of two so that the ring buffer maintenance routines avoid DIV instructions (from the use of a modulo operator that the compiler cannot optimize into a bitwise AND). There just aren't that many integers that even qualify as candidates! I'm pretty sure that 32 is likely too low (though it's hard to tell with buffered I/O on a fast local SSD). 128 might still be too low in extreme corner cases involving high latency and few matches per batch (though I doubt it). 256 seems too implausibly high to ever make sense (but I've been wrong before). -- Peter Geoghegan
