Unable to start replica after failover
Hi, We're using asynchronous streaming replication together with WAL archiving and recovery (we've set these parameters: archive_command, restore_command, recovery_target_timeline=latest). To manage backup and recovery, we're using pg_probackup. We have a 3-node cluster managed by Patroni on Raft. While running our cluster, we encounter a recurring problem: after a failover, the former leader cannot recover neither via the tools provided by PostgreSQL nor with the Patroni tools (pg_rewind). We've seen a number of different WAL reading errors. For example: * "could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50"; or * "could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 0005054E00FB, offset 3448832". There have been some other variations but all such errors have the same cause: 1. At startup, the former leader does not immediately get access to the backup. As a result, the recovery process uses the local WAL, which inevitably leads to a split brain since in most cases there is a lag in replication after a failover. 2. Next, the former leader finally manages to read the latest log from the backup, and that log has a higher priority than the local log, even though the recovery with the local log has already occurred. As a result, we see the following: "new timeline 3 forked off current database system timeline 2 before current recovery point 54E/FB348150". When pg_rewind is run, it also uses the log from the backup (the lagging log from the new leader) instead of the partial log with which the former leader has already been started. Next, pg_rewind returns errors while reading the log from the backup back, looking for the last checkpoint, which is quite reasonable because, once a new leader starts, the point of divergence normally ends up in the next timeline and the previous timeline's backup log does not have a block with the LSN of the divergence. Also, in this case, it is impossible to build a map of the pages that have been changed to roll back the changes using pg_rewind. To avoid the problem, we decided to stop using restore_command. Could you please let us know if there is a better solution to the problem we've described?
RE: Unable to start replica after failover
What additional information is needed?
RE: Unable to start replica after failover
We know what the problem is, but don't know how to solve it correctly. After failover, *new leader* promoted and read local partial log to LSN 54E/FB348118 -> 2022-05-23 01:47:52.124 [12088] LOG: record with incorrect prev-link 0/100 at 54E/FB348118 2022-05-23 01:47:52.124 [12088] LOG: redo done at 54E/FB3480F0 system usage: CPU: user: 321.21 s, system: 144.77 s, elapsed: 354977.44 s rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn: 54E/FB348118, prev 54E/FB3480F0, desc: END_OF_RECOVERY tli 3; prev tli 2; time 2022-05-23 01:47:53.342100 2022-05-23 01:47:52.900 [12088] LOG: selected new timeline ID: 3 2022-05-23 01:47:53.658 [12084] LOG: database system is ready to accept connections <- The *previous leader* started after a failure and read the local partial log to LSN 54E/FB348150 -> 2022-05-23 01:50:12.945 [122891] LOG: redo starts at 54E/F97F4878 2022-05-23 01:50:13.010 [122891] LOG: invalid record length at 54E/FB348150: wanted 24, got 0 2022-05-23 01:50:13.011 [122891] LOG: consistent recovery state reached at 54E/FB348150 2022-05-23 01:50:13.012 [122883] LOG: database system is ready to accept read-only connections <- Our 'restore_command' on *previous leader* restores a partial file from archive (from *new leader*) -> 2022-05-23 01:50:14 [123730]: [1-1]: INFO: pg_probackup archive-get WAL file: 0002054E00FB, remote: ssh, threads: 1/1, batch: 1 <- And this file has a higher priority than the original local partial file on the * previous leader*. And this leads to the problem: --- 2022-05-23 01:50:14.448 [122891] LOG: new timeline 3 forked off current database system timeline 2 before current recovery point 54E/FB348150 --- And pg_rewind tries to use this file (from *new leader*) to build a map of pages that have changed since the last checkpoint --- 2022-05-23 01:51:32,202 INFO: Lock owner: pg51; I am pg01 2022-05-23 01:51:32,209 INFO: Local timeline=2 lsn=54E/FB348150 2022-05-23 01:51:32,252 INFO: master_timeline=3 2022-05-23 01:51:32,254 INFO: master: history=1 53E/FD65D298 no recovery target specified 2 54E/FB348118 no recovery target specified 2022-05-23 01:51:32,499 INFO: running pg_rewind from pg51 2022-05-23 01:51:32,574 INFO: running pg_rewind from dbname=postgres user=postgres host= IP port=5432 target_session_attrs=read-write 2022-05-23 01:51:32,640 INFO: pg_rewind exit code=1 2022-05-23 01:51:32,640 INFO: stdout= 2022-05-23 01:51:32,640 INFO: stderr=pg_rewind: servers diverged at WAL location 54E/FB348118 on timeline 2 pg_rewind: fatal: could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 0002054E00FB, offset 3448832 --- We checked this log (from *new leader*), and when direct reading, the output like this: --- record with incorrect prev-link 0/100 at 54E/FB348118 when backward reading: --- 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 0002054E00FB, offset 3448832 --- Now we have disabled restore_command, but I think it's not best solution. What the correct solution in this case?
RE: Unable to start replica after failover
Hi, Yes, the *patial* from the *new leader* is restored to *last leader* and renamed to 0002054E00FB, without *partial* postfix. >>Postgres asks for file 0002054E00FB but somehow gets >>0002054E00FB.partial instead. Why? Yes, Postgres asks for 0002054E00FB and gets renamed 0002054E00FB.partial (without *partial* postfix). And, finally, the *last leader* has a local 0002054E00FB.partial, with the segments needed for recovery, and 0002054E00FB ( renamed 0002054E00FB.partial from new leader). But 0002054E00FB has a higher priority than 0002054E00FB.partial Also, the *new leader* archives full 0003054E00FB of the new timeline. And this file contains the same as 0002054E00FB data.
RE: Unable to start replica after failover
I can't answer your question. What do you think, the recovery from the archive should work in this case? 1. the partial file should not be restored at all cases 2. the partial file should recover as a partial and replace the local partial 3. recovery command, should return a conflict - file already exists
RE: Unable to start replica after failover
As far as I understand, according to the logs, the last leader does not yet know about the new timeline and it is trying to download the full log from the previous timeline. It seems there should be a conflict that the partial file already exists locally when restoring in this case, but this does not happen. And the partial file from the new leader loaded as a full log. What do you think it is possible to add a check to the restore command, that a partial or full file already exists? Or is disabling the restore command a possible solution in this case?
Streaming wal from primiry terminating
Hi, After failover all stand by nodes could not start streaming wal recovery. Streaming recovery start from 1473/A500, but standby start at 1473/A5FFEE08, this seems to be the problem. What can we do in this case to restore? Is it possible to shift wal streaming recovery point on primary? Can checkpoint on primary help in this situation? 2022-09-26 14:08:23.289 [3747796] LOG: consistent recovery state reached at 1473/A5FFEE08 2022-09-26 14:08:23.289 [3747796] LOG: invalid record length at 1473/A5FFEE08: wanted 24, got 0 2022-09-26 14:08:23.290 [3747793] LOG: database system is ready to accept read-only connections 2022-09-26 14:08:23.292 [3747867] LOG: starting bgworker pgsentinel 2022-09-26 14:08:23.672 [3747868] LOG: started streaming WAL from primary at 1473/A500 on timeline 18 2022-09-26 14:08:24.363 [3747796] LOG: invalid record length at 1473/A5FFEE08: wanted 24, got 0 2022-09-26 14:08:24.366 [3747868] FATAL: terminating walreceiver process due to administrator command 2022-09-26 14:08:24.366 [3747796] LOG: invalid record length at 1473/A5FFEE08: wanted 24, got 0 2022-09-26 14:08:24.366 [3747796] LOG: invalid record length at 1473/A5FFEE08: wanted 24, got 0 2022-09-26 14:08:24.366 [3747796] LOG: invalid record length at 1473/A5FFEE08: wanted 24, got 0 ... ... ... 2022-09-26 14:12:24.055 [3747796] LOG: invalid record length at 1473/A5FFEE08: wanted 24, got 0 2022-09-26 14:12:29.060 [3747796] LOG: invalid record length at 1473/A5FFEE08: wanted 24, got 0 2022-09-26 14:12:34.064 [3747796] LOG: invalid record length at 1473/A5FFEE08: wanted 24, got 0 2022-09-26 14:12:39.068 [3747796] LOG: invalid record length at 1473/A5FFEE08: wanted 24, got 0 2022-09-26 14:12:39.910 [3747793] LOG: received fast shutdown request 2022-09-26 14:12:39.911 [3747793] LOG: aborting any active transactions 2022-09-26 14:12:39.911 [3747867] LOG: bgworker pgsentinel signal: processed SIGTERM 2022-09-26 14:12:39.914 [3747797] LOG: shutting down 2022-09-26 14:12:39.922 [3747793] LOG: database system is shut down
RE: Streaming wal from primiry terminating
Primary_conninfo is set in postgresql.conf The reason is definitely something else, we do not know what. This is fatal in this case. > 2022-09-26 14:08:24.366 [3747796] LOG: invalid record length at > 1473/A5FFEE08: wanted 24, got 0 > 2022-09-26 14:08:24.366 [3747796] LOG: invalid record length at > 1473/A5FFEE08: wanted 24, got 0 LSN 1473/A5FFEE08 does not need to be read, according data from the following log file rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn: 1473/A628, prev 1473/A5FFD070, desc: OVERWRITE_CONTRECORD lsn 1473/A5FFEE08; time 2022-09-26 11:36:58.104667 MSK
RE: Streaming wal from primiry terminating
>> Which version of PostgreSQL are you using? There has been a few commits >> around continuation records lately, so it is hard to say if you are pointing >> at a new issue or if this is a past one already addressed. Hi Michael, We are using PostgreSQL 14.4.
RE: Streaming wal from primiry terminating
Do you need any additional information? -Original Message- From: Lahnov, Igor Sent: Friday, September 30, 2022 10:18 AM To: 'Michael Paquier' Cc: Kyotaro Horiguchi ; pgsql-general@lists.postgresql.org; Timonin, Alexey ; Dmitriyev, Andrey Subject: RE: Streaming wal from primiry terminating >> Which version of PostgreSQL are you using? There has been a few commits >> around continuation records lately, so it is hard to say if you are pointing >> at a new issue or if this is a past one already addressed. Hi Michael, We are using PostgreSQL 14.4.