Hello! My colleague found this issue while running tests with pg_rewind: in some situations using pg_rewind can result in incomplete recovery on the target, with the server seemingly starting normally. Error happen only later when users actually try to query the database.
I think it would be better to handle this properly and report an error during recovery instead of completing the startup and reporting errors when users try to access the data. This also seems to be the intention of these conditions, based on the comment a few lines above my changes in the attached patch. Originally we noticed this with only the wal summarizer enabled, no other settings were changed. Later we realized the issue is reproducible without it, so it could affect earlier postgres versions. The wal summarizer simply delayed recycling on the target, preventing pg_rewind from exiting early with an error. It is also reproducible by changing wal size settings and ensuring the source recycles records, while the target keeps them. This requires an asymmetric configuration which is not ideal, but since `summarize_wal` is only available in pg17+, I based my test case on this condition instead. All we need is a situation where some wal segments are missing on the standby, which seems to be a possibility in the "backup-from-replica" scenario described above? The fix also seems simple: relax the conditions used for the "WAL ends before consistent recovery point" error to catch this case. What do you think?
0001-Enforce-minRecoveryPoint-check-regardless-of-archive.patch
Description: Binary data
