Hi Hackers, While review the patch in the thread [1] I noticed the following:
When the WAL prefetcher encounters a block reference that carries a full page image (FPW) or has BKPBLOCK_WILL_INIT set, it correctly skips issuing a prefetch for that block because the old on-disk content is irrelevant since replay will overwrite or zero the page entirely. However, if a later WAL record within the look-ahead window references the same block without an FPW, the prefetcher would still issue a fadvise64 syscall for it, because the block was never recorded in the duplicate-detection window. Fixed this by making these blocks as recently seen in the FPW and WILL_INIT skip paths. The existing duplicate-check loop then naturally suppresses prefetch attempts for subsequent references to the same block, counting them under the skip_rep stat. This is particularly effective for workloads that produce many sequential writes to the same page (e.g., bulk inserts into heap-only tables), where each page's first post-checkpoint touch generates an FPW and subsequent inserts to the same page follow shortly after in WAL. In order to further improve the wasted prefetch calls, we can try to increase the window size by changing XLOGPREFETCHER_SEQ_WINDOW_SIZE according to max blocks that can be prefetched or maintain a hash table. I did not attempt to do this in this patch because that can impact the redo performance (more cpu cycles). Worst case, the current fix may fail in scenarios where the table has more than four indexes, for example. However, I still believe it is an improvement over the baseline. If we decide to spend more cycles on optimizing the window sizes, it can be in a different patch. Benchmarked recovery with 10 GB of WAL from insert-only workload into a no-index table, replayed from an identical crash snapshot: Fast disk (NVMe) Baseline: redo 37.30s, system CPU 9.38s, 1,204,992 fadvise calls Patched: redo 25.78s, system CPU 3.39s, 122,753 fadvise calls This is nearly 31% faster redo, 90% fewer fadvise syscalls *Prefetch Counters* Counter Baseline Patched Delta prefetch (fadvise issued) 1,204,992 122,753 −89.8% hit 924,457 911,785 −1.4% skip_init 1,097,536 1,097,536 0 skip_fpw 28 28 0 skip_rep 80,020,209 81,115,120 +1,094,911 Slower disk (with ~2ms latency) Baseline: redo 188.04s, system CPU 6.87s, 1,204,992 fadvise calls Patched: redo 60.02s, system CPU 3.39s, 122,753 fadvise calls This is nearly 68% faster redo, 3.1× overall speedup *Configuration:* shared_buffers = '124GB' huge_pages = on wal_buffers = '512MB' max_wal_size = '100GB' checkpoint_timeout = '30min' full_page_writes = on maintenance_io_concurrency = 50 recovery_prefetch = on *Workload:* CREATE TABLE test_noindex(id bigint, val1 int, val2 int, payload text); -- No indexes, no primary key. -- Then insert in batches of 1M rows until WAL reaches 10 GB: INSERT INTO test_noindex SELECT g, (g*7+13)%100000, (g*31+17)%100000, repeat(chr(65+(g%26)),60) FROM generate_series(1, 1000000) g; Thanks, Satya [1] https://www.postgresql.org/message-id/flat/CA%2B3i_M8C%2BrK9vhwBm8U%2Bys2hbDifoBb4Xnws5Wmn2f4u7iqOpA%40mail.gmail.com#8eac90e696baf6e4f58f91482af28e07
0001-xlogprefetcher-record-recent-fpw.patch
Description: Binary data
