WAL File Recovery on Standby Server Stops Before End of WAL Files

2021-10-27 Thread Ryan, Les
Hello,

I'm hoping to get some suggestions on what to do here.  I am running PostgreSQL 
version 13.2 and am shipping the WAL files to a standby server.  Once a day I 
restart the standby server and it recovers the new WAL files that have been 
shipped to it.  Everything was working great until yesterday.  My problem is 
that the WAL recovery is now stopping before it recovers all of the available 
WAL files.  This happened once before and the only way I could get the WAL 
recovery to go past that file was to create a fresh back and restore that.  I'm 
hoping to avoid that as it takes about a week to create the backup.

Here are the specifics:

  *   PostgreSQL version 13.2
  *   The primary server creates the WAL files and a scheduled process copies 
them to a folder on the standby server.
  *   Once a day, the standby server is restared using the following command:
 *   "C:\Program Files\PostgreSQL\13\bin\pg_ctl" restart -D .\
  *   The log contains the following:
2021-10-27 10:26:30.508 MDT [6204] LOG:  starting PostgreSQL 13.2, compiled by 
Visual C++ build 1914, 64-bit
2021-10-27 10:26:30.509 MDT [6204] LOG:  listening on IPv6 address "::", port 
5432
2021-10-27 10:26:30.510 MDT [6204] LOG:  listening on IPv4 address "0.0.0.0", 
port 5432
2021-10-27 10:26:30.585 MDT [2012] LOG:  database system was shut down in 
recovery at 2021-10-27 10:26:29 MDT
2021-10-27 10:26:30.701 MDT [2012] LOG:  entering standby mode
2021-10-27 10:26:30.821 MDT [2012] LOG:  restored log file 
"000104190056" from archive
2021-10-27 10:26:31.158 MDT [2012] LOG:  restored log file 
"000104190052" from archive
2021-10-27 10:26:31.467 MDT [2012] LOG:  redo starts at 419/5229A858
2021-10-27 10:26:31.561 MDT [2012] LOG:  restored log file 
"000104190053" from archive
2021-10-27 10:26:32.108 MDT [2012] LOG:  restored log file 
"000104190054" from archive
2021-10-27 10:26:32.849 MDT [2012] LOG:  restored log file 
"000104190055" from archive
2021-10-27 10:26:33.612 MDT [2012] LOG:  restored log file 
"000104190056" from archive
2021-10-27 10:26:34.342 MDT [2012] LOG:  restored log file 
"000104190057" from archive
2021-10-27 10:26:35.146 MDT [2012] LOG:  restored log file 
"000104190058" from archive
2021-10-27 10:26:35.718 MDT [2012] LOG:  restored log file 
"000104190059" from archive
2021-10-27 10:26:36.188 MDT [2012] LOG:  restored log file 
"00010419005A" from archive
2021-10-27 10:26:36.750 MDT [2012] LOG:  consistent recovery state reached at 
419/5AB8
2021-10-27 10:26:36.752 MDT [6204] LOG:  database system is ready to accept 
read only connections
2021-10-27 10:26:36.823 MDT [6040] LOG:  started streaming WAL from primary at 
419/5A00 on timeline 1

  *   There are many more WAL files available starting with 
00010419005B but the restore process always stops at 
00010419005A.

I have two questions:

  *   Why does the WAL file recovery process now stop after it reads 
00010419005A?
  *   What do I need to do to get PostgreSQL to recover the rest of the 
available WAL files.

Thanks in advance for any suggestions.

Sincerely,
-Les


Les Ryan, P.Eng | WSP

SCADA Engineer
Energy, Resources & Industry

T +1 403-813-6327
E les.r...@wsp.com
O 405 18 St SE. Calgary, Alberta T2E 
6J5





NOTICE: This communication and any attachments ("this message") may contain 
information which is privileged, confidential, proprietary or otherwise subject 
to restricted disclosure under applicable law. This message is for the sole use 
of the intended recipient(s). Any unauthorized use, disclosure, viewing, 
copying, alteration, dissemination or distribution of, or reliance on, this 
message is strictly prohibited. If you have received this message in error, or 
you are not an authorized or intended recipient, please notify the sender 
immediately by replying to this message, delete this message and all copies 
from your e-mail system and destroy any printed copies. You are receiving this 
communication because you are listed as a current WSP contact. Should you have 
any questions regarding WSP's electronic communications policy, please consult 
our Anti-Spam Commitment at www.wsp.com/casl. For any 
concern or if you believe you should not be receiving this message, please 
forward this message to caslcomplia...@wsp.com 
so that we can promptly address your request. Note that not all messages sent 
by WSP qualify as commercial electronic messages.

AVIS : Ce message, incluant tout fichier l'accompagnant (< le message >), peut 
contenir des renseignements ou de l'information privil?gi?s, confidentiels, 
propri?taires ou ? divulgation restreinte en vertu de la loi. Ce message est 
destin? ? l'usage exclu

RE: WAL File Recovery on Standby Server Stops Before End of WAL Files

2021-10-28 Thread Ryan, Les
Hi Kyotaro and Dilip,

Thank you for getting back to me.

Kyotaro: I ran pg_dump and the output was "pg_waldump: fatal: could not read 
file "00010419005A": read 50 of 8192".  I'm guessing that it means 
that wal file 00010419005A is corrupted and that is why the 
recovery process stops there.  Is there any way to fix the file?

Dilip:   setting the log level to debug2 did not provide any additional 
information.  Here are the log entries:

2021-10-28 06:51:06.166 MDT [7556] LOG:  restored log file 
"000104190059" from archive
2021-10-28 06:51:06.464 MDT [7556] DEBUG:  got WAL segment from archive
2021-10-28 06:51:06.579 MDT [7556] LOG:  restored log file 
"00010419005A" from archive
2021-10-28 06:51:06.854 MDT [7556] DEBUG:  got WAL segment from archive
2021-10-28 06:51:07.107 MDT [7556] LOG:  consistent recovery state reached at 
419/5AB8
2021-10-28 06:51:07.107 MDT [7556] DEBUG:  switched WAL source from archive to 
stream after failure
2021-10-28 06:51:07.109 MDT [7844] LOG:  database system is ready to accept 
read only connections
2021-10-28 06:51:07.152 MDT [7844] DEBUG:  forked new backend, pid=6900 
socket=6068

I set the log level to debug5 and here is what I got:

2021-10-28 06:25:41.262 MDT [6288] CONTEXT:  WAL redo at 419/5ABFFF60 for 
Btree/INSERT_LEAF: off 130
2021-10-28 06:25:41.262 MDT [6288] DEBUG:  record known xact 33776257 
latestObservedXid 33776257
2021-10-28 06:25:41.262 MDT [6288] CONTEXT:  WAL redo at 419/5ABFFFA0 for 
Heap/INSERT: off 95 flags 0x00
2021-10-28 06:25:41.262 MDT [6288] LOG:  consistent recovery state reached at 
419/5AB8
2021-10-28 06:25:41.263 MDT [6288] DEBUG:  switched WAL source from archive to 
stream after failure
2021-10-28 06:25:41.264 MDT [5512] LOG:  database system is ready to accept 
read only connections

Does the "switched WAL source from archive to stream after failure" indicate a 
problem with the WAL file?

Anyway, it looks like I need to restore the standby server from a new backup.  
Thank you both for your help.

Sincerely,
-Les

-Original Message-
From: Dilip Kumar 
Sent: October 27, 2021 10:29 PM
To: Kyotaro Horiguchi 
Cc: Ryan, Les ; pgsql-generallists.postgresql.org 

Subject: Re: WAL File Recovery on Standby Server Stops Before End of WAL Files

On Thu, Oct 28, 2021 at 7:28 AM Kyotaro Horiguchi  
wrote:
>
> At Wed, 27 Oct 2021 16:42:52 +, "Ryan, Les" 
> wrote in
> > 2021-10-27 10:26:31.467 MDT [2012] LOG:  redo starts at 419/5229A858
> ...
> > 2021-10-27 10:26:36.188 MDT [2012] LOG:  restored log file
> > "00010419005A" from archive
> > 2021-10-27 10:26:36.750 MDT [2012] LOG:  consistent recovery state
> > reached at 419/5AB8
> > 2021-10-27 10:26:36.752 MDT [6204] LOG:  database system is ready to
> > accept read only connections
> > 2021-10-27 10:26:36.823 MDT [6040] LOG:  started streaming WAL from
> > primary at 419/5A00 on timeline 1
> >
> >   *   There are many more WAL files available starting with 
> > 00010419005B but the restore process always stops at 
> > 00010419005A.
> >
> > I have two questions:
> >
> >   *   Why does the WAL file recovery process now stop after it reads 
> > 00010419005A?
> >   *   What do I need to do to get PostgreSQL to recover the rest of the 
> > available WAL files.
>
> The info above alone donesn't clearly suggest anything about the
> reason. Could you show the result of "pg_waldump
> 00010419005A 2>&1 | tail -5"?  What I'm expecting to see
> is an error message from pg_waldump before the end of the file. It
> would be the immediate cause of the problem.

+1, that will be the best place to start with, additionally, you can
enable DEBUG2 message so that from logs we can identify why it could not 
continue recovery from the archive.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




NOTICE: This communication and any attachments ("this message") may contain 
information which is privileged, confidential, proprietary or otherwise subject 
to restricted disclosure under applicable law. This message is for the sole use 
of the intended recipient(s). Any unauthorized use, disclosure, viewing, 
copying, alteration, dissemination or distribution of, or reliance on, this 
message is strictly prohibited. If you have received this message in error, or 
you are not an authorized or intended recipient, please notify the sender 
immediately by replying to this message, delete this message and all copies 
from your e-mail system and destroy any printed copies. You are receiving this 
communication because you are listed as a current WSP contact. Should you have 
any quest