On 19/03/2026 20:02, 段坤仁(刻韧) wrote:
Hi hackers,
This is related to two recent threads: [0] https://www.postgresql.org/message-id/flat/172e5723-d65f-4eec-b512-14beacb326ce%40yandex.ru
[1] 
https://www.postgresql.org/message-id/flat/CACV2tSw3VYS7d27ftO_cs%2BaF3M54%2BJwWBbqSGLcKoG9cvyb6EA%40mail.gmail.com
I think there may be a bug in the compat logic introduced in
RecordNewMultiXact() to fix [0]. The compat logic handles WAL
generated by older  minor versions where ZERO_OFF_PAGE N+1
may appear after CREATE_ID:N in the WAL stream. However, the condition
used to detect whether the next offset page needs initialization
seems to fail after a crash-restart, causing the standby to enter a
FATAL crash loop.
We hit this in production: a PG 16.12 standby replaying WAL from a
PG 16.8 primary entered a crash-restart loop with:
FATAL:  could not access status of transaction 1277952
DETAIL:  Could not read from file "pg_multixact/offsets/0013" at offset
131072: read too few bytes.
CONTEXT:  WAL redo at 5194/5F5F7118 for MultiXact/CREATE_ID: 1277951
offset 2605275 nmembers 2: 814605500 (keysh) 814605501 (keysh)

:-(. This is a gift that keeps giving.

== Analysis ==

...

== Timeline: the lock-free window ==
...

== Reproduction ==

Thanks for the thorough analysis and the repro!

== Fix ideas ==
I have two ideas for fixing this. Two independent patches are
attached.
Idea 1 (0001-Fix-multixact-compat-logic-via-unconditional-zero.patch): Remove the "latest_page_number == pageno" condition entirely. During
recovery, whenever we cross a page boundary in RecordNewMultiXact(),
unconditionally ensure the next page is initialized via
SimpleLruZeroPage().
Pro: Minimal change, easy to review, impossible to miss any edge
case since we always initialize. Safe because SimpleLruZeroPage is
idempotent -- at this point the next page only contains zeros
(CREATE_ID records for that page haven't been replayed yet), so
re-zeroing loses nothing.

Con: When replaying WAL from a new-version primary (where
ZERO_OFF_PAGE is emitted BEFORE CREATE_ID), the page has already
been initialized by the ZERO_OFF_PAGE record. This patch will
redundantly zero+write it again. The cost is one extra
SimpleLruZeroPage + SimpleLruWritePage per 2048 multixacts during
recovery, which is negligible in practice, but not zero.

This fix is not correct. The CREATE_ID records can appear in the WAL out of order, e.g:

CREATE_ID:N+2 -> CREATE_ID:N+3 -> CREATE_ID:N+1

With the patch, replaying the CREATE_ID:N+1 record would overwrite the changes made by the CREATE_ID:N+2 and CREATE_ID:N+3 records.

Idea 2 (0002-Fix-multixact-compat-logic-via-SLRU-buffer-check.patch):
Also remove the "latest_page_number == pageno" condition, but add a
two-tier check: first, if latest_page_number == pageno (fast path),
we know the next page needs initialization. Otherwise, scan the SLRU
buffer slots to check if the next page already exists (e.g. from a
prior ZERO_OFF_PAGE replay). Only initialize if the page is not
found in the buffer.
Pro: Avoids the redundant zero+write when replaying new-version WAL
where ZERO_OFF_PAGE has already been replayed before CREATE_ID. The
SLRU buffer scan is O(n_buffers) which is cheap (default 8 buffers).
Con: More complex than Idea 1. The SLRU buffer check is a heuristic
-- if the page was written out and evicted from the buffer pool
before CREATE_ID replay, we'd zero it again (still safe, just
redundant). Also introduces a dependency on SLRU buffer pool
internals.
I'd appreciate it if someone could review whether my analysis of the
bug is correct. If it is, I'd be happy to hear any thoughts on the
patches or better approaches to fix this.

I think this has the same problem, although the extra conditions make it much harder to hit. Maybe even impossible, because the SLRU never evicts the latest page. But it seems fragile to rely on that, and the consequences are pretty bad if you get that wrong.

Idea 3:

I think a better fix is to accept that our tracking is a little imprecise and use SimpleLruDoesPhysicalPageExist() to check if the page exists. I suspect that's too expensive to do on every RecordNewMultiXact() call that crosses a page, but perhaps we could do it once at StartupMultiXact().

Or perhaps track last-zeroed page separately from latest_page_number, and if we haven't seen any XLOG_MULTIXACT_ZERO_OFF_PAGE records yet after startup, call SimpleLruDoesPhysicalPageExist() to determine if initialization is needed. Attached patch does that.

- Heikki
From 70dc0fe41f0aa6afb15358b6362900259e3d08a3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Thu, 19 Mar 2026 21:27:51 +0200
Subject: [PATCH 1/1] Fix multixact backwards-compatibility with CHECKPOINT
 race condition

If a CHECKPOINT record with nextMulti N is written to the WAL before
the CREATE_ID record for N, and N happens to be the first multixid on
an offset page, the backwards compatibility logic to tolerate WAL
generated by older minor versions failed to compensate for the missing
XLOG_MULTIXACT_ZERO_OFF_PAGE record. In that case, the
latest_page_number was initialized at the start of WAL replay to the
page for nextMulti from the CHECKPOINT record, even if we had not seen
the CREATE_ID record for that multixid yet, which fooled the backwards
compatibility logic to think that the page was already initialized.

To fix, track the last XLOG_MULTIXACT_ZERO_OFF_PAGE that we've seen
separately from latest_page_number. If we haven't seen any
XLOG_MULTIXACT_ZERO_OFF_PAGE records yet, use
SimpleLruDoesPhysicalPageExist() to check if the page needs to be
initialized.
---
 src/backend/access/transam/multixact.c | 65 +++++++++++++++++++-------
 1 file changed, 47 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index f9bd1dd19e6..5dcb7f9068f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -417,7 +417,17 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
-/* hack to deal with WAL generated with older minor versions */
+/*
+ * Hack to deal with WAL generated with older minor versions.
+ *
+ * last_initialized_offsets_page is the XLOG_MULTIXACT_ZERO_OFF_PAGE record
+ * that we saw during WAL replay, or -1 if we haven't seen any yet.
+ *
+ * pre_initialized_offsets_page is the last page that was implicitly
+ * initialized by replaying a XLOG_MULTIXACT_CREATE_ID record, when we had not
+ * seen a XLOG_MULTIXACT_ZERO_OFF_PAGE record for the page yet.
+ */
+static int64 last_initialized_offsets_page = -1;
 static int64 pre_initialized_offsets_page = -1;
 
 /* internal MultiXactId management */
@@ -982,29 +992,46 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	 * such a version, the next page might not be initialized yet.  Initialize
 	 * it now.
 	 */
-	if (InRecovery &&
-		next_pageno != pageno &&
-		pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
+	if (InRecovery && next_pageno != pageno)
 	{
-		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+		bool		init_needed;
 
-		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
-		LWLockAcquire(lock, LW_EXCLUSIVE);
+		/*
+		 * Check if the page exists. SimpleLruDoesPhysicalPageExist() is
+		 * somewhat expensive, however, so if we have seen any
+		 * XLOG_MULTIXACT_ZERO_OFF_PAGE records during replay, we trust
+		 * without checking that the last page initialized by
+		 * XLOG_MULTIXACT_ZERO_OFF_PAGE already exists, and the one after that
+		 * does not.
+		 */
+		if (last_initialized_offsets_page == -1)
+			init_needed = !SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, next_pageno);
+		else
+			init_needed = (last_initialized_offsets_page == pageno);
 
-		/* Create and zero the page */
-		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+		if (init_needed)
+		{
+			elog(DEBUG1, "next offsets page is not initialized, initializing it now");
 
-		/* Make sure it's written out */
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
 
-		LWLockRelease(lock);
+			/* Create and zero the page */
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
 
-		/*
-		 * Remember that we initialized the page, so that we don't zero it
-		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
-		 */
-		pre_initialized_offsets_page = next_pageno;
+			/* Make sure it's written out */
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+			LWLockRelease(lock);
+
+			/*
+			 * Remember that we initialized the page, so that we don't zero it
+			 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+			 */
+			pre_initialized_offsets_page = next_pageno;
+			last_initialized_offsets_page = next_pageno;
+		}
 	}
 
 	/*
@@ -3560,6 +3587,8 @@ multixact_redo(XLogReaderState *record)
 			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
 			LWLockRelease(lock);
+
+			last_initialized_offsets_page = pageno;
 		}
 		else
 			elog(DEBUG1, "skipping initialization of offsets page " INT64_FORMAT " because it was already initialized on multixid creation", pageno);
-- 
2.47.3

Reply via email to