Unable to start replica after failover

2022-08-01 Thread Lahnov, Igor
Hi,
We're using asynchronous streaming replication together with WAL archiving and 
recovery (we've set these parameters: archive_command, restore_command, 
recovery_target_timeline=latest).
To manage backup and recovery, we're using pg_probackup.
We have a 3-node cluster managed by Patroni on Raft.
While running our cluster, we encounter a recurring problem: after a failover, 
the former leader cannot recover neither via the tools provided by PostgreSQL 
nor with the Patroni tools (pg_rewind).
We've seen a number of different WAL reading errors. For example:
* "could not find previous WAL record at E6F/C2436F50: invalid resource 
manager ID 139 at E6F/C2436F50"; or
* "could not find previous WAL record at 54E/FB348118: unexpected 
pageaddr 54E/7B34A000 in log segment 0005054E00FB, offset 3448832".
There have been some other variations but all such errors have the same cause:
1.   At startup, the former leader does not immediately get access to the 
backup. As a result, the recovery process uses the local WAL, which inevitably 
leads to a split brain since in most cases there is a lag in replication after 
a failover.
2.   Next, the former leader finally manages to read the latest log from 
the backup, and that log has a higher priority than the local log, even though 
the recovery with the local log has already occurred.
As a result, we see the following:
"new timeline 3 forked off current database system timeline 2 before current 
recovery point 54E/FB348150".
When pg_rewind is run, it also uses the log from the backup (the lagging log 
from the new leader) instead of the partial log with which the former leader 
has already been started.
Next, pg_rewind returns errors while reading the log from the backup back, 
looking for the last checkpoint, which is quite reasonable because, once a new 
leader starts, the point of divergence normally ends up in the next timeline 
and the previous timeline's backup log does not have a block with the LSN of 
the divergence.
Also, in this case, it is impossible to build a map of the pages that have been 
changed to roll back the changes using pg_rewind.

To avoid the problem, we decided to stop using restore_command. Could you 
please let us know if there is a better solution to the problem we've described?



RE: Unable to start replica after failover

2022-08-15 Thread Lahnov, Igor
What additional information is needed?



   






RE: Unable to start replica after failover

2022-08-23 Thread Lahnov, Igor
We know what the problem is, but don't know how to solve it correctly.

After failover, *new leader* promoted and read local partial log to LSN 
54E/FB348118

->
2022-05-23 01:47:52.124 [12088] LOG: record with incorrect prev-link 0/100 at 
54E/FB348118
2022-05-23 01:47:52.124 [12088] LOG: redo done at 54E/FB3480F0 system usage: 
CPU: user: 321.21 s, system: 144.77 s, elapsed: 354977.44 s
rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn: 54E/FB348118, prev 54E/FB3480F0, 
desc: END_OF_RECOVERY tli 3; prev tli 2; time 2022-05-23 01:47:53.342100 
2022-05-23 01:47:52.900 [12088] LOG: selected new timeline ID: 3
2022-05-23 01:47:53.658 [12084] LOG: database system is ready to accept 
connections
<-

The *previous leader* started after a failure and read the local partial log to 
LSN 54E/FB348150

->
2022-05-23 01:50:12.945 [122891] LOG: redo starts at 54E/F97F4878
2022-05-23 01:50:13.010 [122891] LOG: invalid record length at 54E/FB348150: 
wanted 24, got 0
2022-05-23 01:50:13.011 [122891] LOG: consistent recovery state reached at 
54E/FB348150
2022-05-23 01:50:13.012 [122883] LOG: database system is ready to accept 
read-only connections
<-

Our 'restore_command' on *previous leader* restores a partial file from archive 
(from *new leader*)

->
2022-05-23 01:50:14 [123730]: [1-1]: INFO: pg_probackup archive-get WAL file: 
0002054E00FB, remote: ssh, threads: 1/1, batch: 1
<-

And this file has a higher priority than the original local partial file on the 
* previous leader*. And this leads to the problem:

---
2022-05-23 01:50:14.448 [122891] LOG: new timeline 3 forked off current 
database system timeline 2 before current recovery point 54E/FB348150
---

And pg_rewind tries to use this file (from *new leader*) to build a map of 
pages that have changed since the last checkpoint

---
2022-05-23 01:51:32,202 INFO: Lock owner: pg51; I am pg01
2022-05-23 01:51:32,209 INFO: Local timeline=2 lsn=54E/FB348150
2022-05-23 01:51:32,252 INFO: master_timeline=3
2022-05-23 01:51:32,254 INFO: master: history=1 53E/FD65D298 no recovery target 
specified
2 54E/FB348118 no recovery target specified
2022-05-23 01:51:32,499 INFO: running pg_rewind from pg51
2022-05-23 01:51:32,574 INFO: running pg_rewind from dbname=postgres 
user=postgres host= IP port=5432 target_session_attrs=read-write
2022-05-23 01:51:32,640 INFO: pg_rewind exit code=1
2022-05-23 01:51:32,640 INFO: stdout=
2022-05-23 01:51:32,640 INFO: stderr=pg_rewind: servers diverged at WAL 
location 54E/FB348118 on timeline 2
pg_rewind: fatal: could not find previous WAL record at 54E/FB348118: 
unexpected pageaddr 54E/7B34A000 in log segment 0002054E00FB, 
offset 3448832
---

We checked this log (from *new leader*), and when direct reading, the output 
like this:
---
record with incorrect prev-link 0/100 at 54E/FB348118

when backward reading:
---
54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 
0002054E00FB, offset 3448832
---

Now we have disabled restore_command, but I think it's not best solution. What 
the correct solution in this case?




RE: Unable to start replica after failover

2022-08-24 Thread Lahnov, Igor
Hi,
Yes, the *patial* from the *new leader* is restored to *last leader* and 
renamed to 0002054E00FB, without *partial* postfix.

>>Postgres asks for file 0002054E00FB but somehow gets
>>0002054E00FB.partial instead. Why?

Yes, Postgres asks for 0002054E00FB and gets renamed 
0002054E00FB.partial (without *partial* postfix).

And, finally, the *last leader* has a local 0002054E00FB.partial, 
with the segments needed for recovery, and 0002054E00FB ( renamed 
0002054E00FB.partial from new leader). But 0002054E00FB 
has a higher priority than 0002054E00FB.partial
Also, the *new leader* archives full 0003054E00FB of the new 
timeline. And this file contains the same as 0002054E00FB data.






RE: Unable to start replica after failover

2022-08-26 Thread Lahnov, Igor
I can't answer your question. What do you think, the recovery from the archive 
should work in this case?
1. the partial file should not be restored at all cases
2. the partial file should recover as a partial and replace the local partial
3. recovery command, should return a conflict - file already exists




RE: Unable to start replica after failover

2022-09-06 Thread Lahnov, Igor
As far as I understand, according to the logs, the last leader does not yet 
know about the new timeline
and it is trying to download the full log from the previous timeline.
It seems there should be a conflict that the partial file already exists 
locally when restoring in this case, but this does not happen.
And the partial file from the new leader loaded as a full log.
What do you think it is possible to add a check to the restore command, that a 
partial or full file already exists?
Or is disabling the restore command a possible solution in this case?


Streaming wal from primiry terminating

2022-09-28 Thread Lahnov, Igor
Hi,
After failover all stand by nodes could not start streaming wal recovery.
Streaming recovery start from 1473/A500, but standby start at 
1473/A5FFEE08, this seems to be the problem.
What can we do in this case to restore?
Is it possible to shift wal streaming recovery point on primary?
Can checkpoint on primary help in this situation?


2022-09-26 14:08:23.289  [3747796] LOG:  consistent recovery state reached 
at 1473/A5FFEE08
2022-09-26 14:08:23.289  [3747796] LOG:  invalid record length at 
1473/A5FFEE08: wanted 24, got 0
2022-09-26 14:08:23.290  [3747793] LOG:  database system is ready to accept 
read-only connections
2022-09-26 14:08:23.292  [3747867] LOG:  starting bgworker pgsentinel
2022-09-26 14:08:23.672  [3747868] LOG:  started streaming WAL from primary 
at 1473/A500 on timeline 18
2022-09-26 14:08:24.363  [3747796] LOG:  invalid record length at 
1473/A5FFEE08: wanted 24, got 0
2022-09-26 14:08:24.366  [3747868] FATAL:  terminating walreceiver process 
due to administrator command
2022-09-26 14:08:24.366  [3747796] LOG:  invalid record length at 
1473/A5FFEE08: wanted 24, got 0
2022-09-26 14:08:24.366  [3747796] LOG:  invalid record length at 
1473/A5FFEE08: wanted 24, got 0
2022-09-26 14:08:24.366  [3747796] LOG:  invalid record length at 
1473/A5FFEE08: wanted 24, got 0
...
...
...
2022-09-26 14:12:24.055  [3747796] LOG:  invalid record length at 
1473/A5FFEE08: wanted 24, got 0
2022-09-26 14:12:29.060  [3747796] LOG:  invalid record length at 
1473/A5FFEE08: wanted 24, got 0
2022-09-26 14:12:34.064  [3747796] LOG:  invalid record length at 
1473/A5FFEE08: wanted 24, got 0
2022-09-26 14:12:39.068  [3747796] LOG:  invalid record length at 
1473/A5FFEE08: wanted 24, got 0
2022-09-26 14:12:39.910  [3747793] LOG:  received fast shutdown request
2022-09-26 14:12:39.911  [3747793] LOG:  aborting any active transactions
2022-09-26 14:12:39.911  [3747867] LOG:  bgworker pgsentinel signal: 
processed SIGTERM
2022-09-26 14:12:39.914  [3747797] LOG:  shutting down
2022-09-26 14:12:39.922  [3747793] LOG:  database system is shut down


RE: Streaming wal from primiry terminating

2022-09-29 Thread Lahnov, Igor
Primary_conninfo is set in postgresql.conf
The reason is definitely something else, we do not know what. This is fatal in 
this case.


> 2022-09-26 14:08:24.366 [3747796] LOG: invalid record length at 
> 1473/A5FFEE08: wanted 24, got 0
> 2022-09-26 14:08:24.366 [3747796] LOG: invalid record length at 
> 1473/A5FFEE08: wanted 24, got 0

LSN 1473/A5FFEE08 does not need to be read, according data from the following 
log file

rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn: 1473/A628, prev 
1473/A5FFD070, desc: OVERWRITE_CONTRECORD lsn 1473/A5FFEE08; time 2022-09-26 
11:36:58.104667 MSK






RE: Streaming wal from primiry terminating

2022-09-29 Thread Lahnov, Igor
>> Which version of PostgreSQL are you using?  There has been a few commits 
>> around continuation records lately, so it is hard to say if you are pointing 
>> at a new issue or if this is a past one already addressed.

Hi Michael,
We are using PostgreSQL 14.4.






RE: Streaming wal from primiry terminating

2022-10-05 Thread Lahnov, Igor
Do you need any additional information?

-Original Message-
From: Lahnov, Igor 
Sent: Friday, September 30, 2022 10:18 AM
To: 'Michael Paquier' 
Cc: Kyotaro Horiguchi ; 
pgsql-general@lists.postgresql.org; Timonin, Alexey 
; Dmitriyev, Andrey 
Subject: RE: Streaming wal from primiry terminating

>> Which version of PostgreSQL are you using?  There has been a few commits 
>> around continuation records lately, so it is hard to say if you are pointing 
>> at a new issue or if this is a past one already addressed.

Hi Michael,
We are using PostgreSQL 14.4.