Hi hackers, This is related to two recent threads: [0] https://www.postgresql.org/message-id/flat/172e5723-d65f-4eec-b512-14beacb326ce%40yandex.ru [1] https://www.postgresql.org/message-id/flat/CACV2tSw3VYS7d27ftO_cs%2BaF3M54%2BJwWBbqSGLcKoG9cvyb6EA%40mail.gmail.com I think there may be a bug in the compat logic introduced in RecordNewMultiXact() to fix [0]. The compat logic handles WAL generated by older minor versions where ZERO_OFF_PAGE N+1 may appear after CREATE_ID:N in the WAL stream. However, the condition used to detect whether the next offset page needs initialization seems to fail after a crash-restart, causing the standby to enter a FATAL crash loop. We hit this in production: a PG 16.12 standby replaying WAL from a PG 16.8 primary entered a crash-restart loop with: > FATAL: could not access status of transaction 1277952 > DETAIL: Could not read from file "pg_multixact/offsets/0013" at offset > 131072: read too few bytes. > CONTEXT: WAL redo at 5194/5F5F7118 for MultiXact/CREATE_ID: 1277951 >offset 2605275 nmembers 2: 814605500 (keysh) 814605501 (keysh) == Analysis == The compat logic condition is: > if (InRecovery && > next_pageno != pageno && > MultiXactOffsetCtl->shared->latest_page_number == pageno) The third condition (latest_page_number == pageno) assumes that if latest_page_number equals the current multixact's page, then the next page hasn't been initialized yet. This works during normal (non-crash) replay because latest_page_number is incrementally maintained by SimpleLruZeroPage() calls. But after a crash-restart, StartupMultiXact() resets latest_page_number to page(checkPoint.nextMulti): > void StartupMultiXact(void) > { > MultiXactId multi = MultiXactState->nextMXact; > pageno = MultiXactIdToOffsetPage(multi); > MultiXactOffsetCtl->shared->latest_page_number = pageno; > } When the checkpoint captured an already-advanced nextMXact (which happens when GetNewMultiXactId() incremented nextMXact before the backend wrote CREATE_ID to WAL), checkPoint.nextMulti >= N+1, so latest_page_number = page(N+1) = P+1. The compat check becomes: > P+1 == P -> FALSE -> compat logic skipped The subsequent SimpleLruReadPage(next_pageno=P+1) then fails because page P+1 doesn't exist on disk (ZERO_OFF_PAGE:P+1 hasn't been replayed yet -- it's after CREATE_ID:N in the WAL due to the reordering that the compat logic is supposed to handle). The fix for [1] addressed a related issue where TRUNCATE replay reset latest_page_number to a very old value (latest_page_number << pageno). That fix addresses the "too small" direction but not the "too large" direction described here. == Timeline: the lock-free window == The root cause is a lock-free window in MultiXactIdCreateFromMembers().GetNewMultiXactId() advances nextMXact under MultiXactGenLock, but the lock is released BEFORE XLogInsert(CREATE_ID). Any checkpoint that runs in this window captures the advanced nextMXact without the corresponding CREATE_ID in WAL. Assume multi N is the last entry on offset page P (entry 2047), so multi N+1 falls on page P+1. 1. [Backend] GetNewMultiXactId(): - LWLockAcquire(MultiXactGenLock) - result = nextMXact (= N) - nextMXact++ (now N+1) - LWLockRelease(MultiXactGenLock) - return N <<< lock-free window opens here >>> nextMXact=N+1 is globally visible, but CREATE_ID:N not in WAL yet. 2. [Checkpointer] CHECKPOINT runs during the window: - reads nextMXact = N+1 - writes checkpoint record with nextMulti=N+1 3. [Standby] replays the checkpoint, does restartpoint (persists nextMulti=N+1 to disk) 4. [Standby] *** CRASH *** <<< lock-free window closes >>> 5. [Backend] XLogInsert(CREATE_ID: N) -- this WAL record lands AFTER the checkpoint record 6. [Backend] RecordNewMultiXact(N): - next_pageno = page(N+1) = P+1 - SimpleLruReadPage(P+1) -- works fine on primary The resulting WAL stream on disk: > ... CHECKPOINT(nextMulti=N+1) ... CREATE_ID:N ... ZERO_OFF_PAGE:P+1 ... > ^ ^ ^ > redo starts here needs page P+1 page P+1 init > after crash-restart After standby crash-restart: 7. [Standby] StartupMultiXact(): - multi = nextMXact = N+1 (from checkpoint) - latest_page_number = page(N+1) = P+1 8. [Standby] Redo CREATE_ID:N -> RecordNewMultiXact(N): - pageno = page(N) = P, next_pageno = page(N+1) = P+1 - Compat logic check: InRecovery = true next_pageno(P+1) != pageno(P) latest_page_number(P+1) == pageno(P) -- FAIL --> compat logic SKIPPED - SimpleLruReadPage(P+1) page P+1 not on disk yet (ZERO_OFF_PAGE:P+1 not replayed) --> FATAL: read too few bytes == Reproduction == I wrote a deterministic reproducer using sleep injection. A sleep injection patch (sleep-injection-for-multixact-compat-logic-bug-repro.patch) adds a 60-second sleep in MultiXactIdCreateFromMembers(), after GetNewMultiXactId() returns but before XLogInsert(CREATE_ID). The sleep only triggers when the allocated multi is the last entry on an offset page (entry 2047) AND the file /tmp/mxact_sleep_enabled exists (so the batch phase that generates multixacts is not affected). The setup: - Primary: PG 16.11 with the sleep injection patch applied. - Standby: PG 16.13 (contains the compat logic from [0] and[1]). The test script (test_multixact_compat.sh) does the following: 1. Initialize primary, create test table 2. Generate multixacts until nextMulti's entry is exactly 2047 (last entry on page P) 3. pg_basebackup to create standby, start with pg_new (16.13), checkpoint_timeout=30s 4. Session A: BEGIN; SELECT * FROM t WHERE id=1 FOR SHARE; (dirties heap buffer) 5. CHECKPOINT to flush the dirty buffer (*) 6. Enable sleep injection (touch /tmp/mxact_sleep_enabled) 7. Session B: SELECT * FROM t WHERE id=1 FOR SHARE; -> creates multixact N, GetNewMultiXactId advances nextMXact to N+1, then sleeps 60s 8. During sleep: CHECKPOINT on primary -> completes instantly (buffer is clean) -> captures nextMulti=N+1 in checkpoint record 9. Wait for standby to replay checkpoint and do restartpoint 10. Crash standby (pg_ctl -m immediate stop) 11. Wait for sleep to end -> CREATE_ID:N written to WAL 12. Restart standby -> FATAL The CHECKPOINT in step 5 is essential. Without it, the checkpoint in step 8 blocks because heap_lock_tuple holds the buffer content lock in EXCLUSIVE mode across the entire MultiXactIdCreateFromMembers call (including the sleep). Session A's FOR SHARE dirtied the heap buffer, so BufferSync needs LW_SHARED on the content lock and blocks. The intermediate CHECKPOINT flushes the dirty buffer before Session B acquires the content lock. Result: > DEBUG: next MultiXactId: 114688; next MultiXactOffset: 688233 > FATAL: could not access status of transaction 114688 > DETAIL: Could not read from file "pg_multixact/offsets/0001" > at offset 196608: read too few bytes. > CONTEXT: WAL redo at 0/40426E0 for MultiXact/CREATE_ID: 114687 > offset 688231 nmembers 2: 116735 (sh) 116736 (sh) == Fix ideas == I have two ideas for fixing this. Two independent patches are attached. Idea 1 (0001-Fix-multixact-compat-logic-via-unconditional-zero.patch): Remove the "latest_page_number == pageno" condition entirely. During recovery, whenever we cross a page boundary in RecordNewMultiXact(), unconditionally ensure the next page is initialized via SimpleLruZeroPage(). Pro: Minimal change, easy to review, impossible to miss any edge case since we always initialize. Safe because SimpleLruZeroPage is idempotent -- at this point the next page only contains zeros (CREATE_ID records for that page haven't been replayed yet), so re-zeroing loses nothing. Con: When replaying WAL from a new-version primary (where ZERO_OFF_PAGE is emitted BEFORE CREATE_ID), the page has already been initialized by the ZERO_OFF_PAGE record. This patch will redundantly zero+write it again. The cost is one extra SimpleLruZeroPage + SimpleLruWritePage per 2048 multixacts during recovery, which is negligible in practice, but not zero. Idea 2 (0002-Fix-multixact-compat-logic-via-SLRU-buffer-check.patch): Also remove the "latest_page_number == pageno" condition, but add a two-tier check: first, if latest_page_number == pageno (fast path), we know the next page needs initialization. Otherwise, scan the SLRU buffer slots to check if the next page already exists (e.g. from a prior ZERO_OFF_PAGE replay). Only initialize if the page is not found in the buffer. Pro: Avoids the redundant zero+write when replaying new-version WAL where ZERO_OFF_PAGE has already been replayed before CREATE_ID. The SLRU buffer scan is O(n_buffers) which is cheap (default 8 buffers). Con: More complex than Idea 1. The SLRU buffer check is a heuristic -- if the page was written out and evicted from the buffer pool before CREATE_ID replay, we'd zero it again (still safe, just redundant). Also introduces a dependency on SLRU buffer pool internals. I'd appreciate it if someone could review whether my analysis of the bug is correct. If it is, I'd be happy to hear any thoughts on the patches or better approaches to fix this. Regards, Duan
sleep-injection-for-multixact-compat-logic-bug-repro.patch
Description: Binary data
test_multixact_compat.sh
Description: Binary data
0001-Fix-multixact-compat-logic-via-unconditional-zero.patch
Description: Binary data
0002-Fix-multixact-compat-logic-via-SLRU-buffer-check.patch
Description: Binary data
