How does pg_basebackup manage to create a snapshot of the filesystem?

2020-03-19 Thread Dennis Jacobfeuerborn
Hi,
I'm currently trying to understand how backups work. In the
documentation in section "25.2. File System Level Backup" it says that
filesystem level backups can only be made when the database if offline
yet pg_basebackup seems to do just that but works while the database is
online. Am I misunderstanding something here or does pg_basebackup use
some particular features of Postgres to accomplish this?

Regards,
  Dennis




"invalid record length" after restoring pg_basebackup

2020-11-13 Thread Dennis Jacobfeuerborn
Hi,
I've run into a strange issue after restoring a backup that I created
using pg_basebackup on a standby instance. The command I use to create
the backup is this:

pg_basebackup -v --write-recovery-conf -h$BACKUP_HOST -p5432
-U$BACKUP_USER --format tar --wal-method stream --compress=2 -D
"$BACKUP_DIR"

This backup runs fine and produces a "base.tar.gz" and "pg_wal.tar.gz"
file. The server version is 11.7.

To restore the backup on another system I unpack the "base.tar.gz" file
into the data directory and "pg_wal.tar.gz" into the "pg_wal"
sub-directory. I then comment out the "primary_conninfo" directive in
the "recovery.conf" file since I don't want to replicate from the
primary on this system I just want to restore the state from the backup.

All of this works fine and the logs report that the db reaches a
consistent recovery state but as last entry it reports an "invalid
record length":

2020-11-13 12:25:34.266 UTC [1] LOG:  listening on IPv4 address
"0.0.0.0", port 5432
2020-11-13 12:25:34.266 UTC [1] LOG:  listening on IPv6 address "::",
port 5432
2020-11-13 12:25:34.269 UTC [1] LOG:  listening on Unix socket
"/var/run/postgresql/.s.PGSQL.5432"
2020-11-13 12:25:34.437 UTC [20] LOG:  database system was interrupted
while in recovery at log time 2020-11-03 14:36:00 UTC
2020-11-13 12:25:34.437 UTC [20] HINT:  If this has occurred more than
once some data might be corrupted and you might need to choose an
earlier recovery target.
2020-11-13 12:25:35.359 UTC [20] WARNING:  recovery command file
"recovery.conf" specified neither primary_conninfo nor restore_command
2020-11-13 12:25:35.359 UTC [20] HINT:  The database server will
regularly poll the pg_wal subdirectory to check for files placed there.
2020-11-13 12:25:35.359 UTC [20] LOG:  entering standby mode
2020-11-13 12:25:35.372 UTC [20] LOG:  redo starts at 932/9D123E70
2020-11-13 12:26:45.435 UTC [20] LOG:  consistent recovery state reached
at 933/AA63F600
2020-11-13 12:26:45.435 UTC [1] LOG:  database system is ready to accept
read only connections
2020-11-13 12:26:45.444 UTC [20] LOG:  invalid record length at
933/AA6DA660: wanted 24, got 0

Looking at the WAL data the last file is "0001093300AA" and
there is a file "archive_status/0001093300A9.done" but no
corresponding ".done" file for the "...AA" WAL file.

Running pg_waldump on the "...AA" file shows the following at the end:

$ pg_waldump 0001093300AA
...
rmgr: Heaplen (rec/tot): 54/54, tx:1743220, lsn:
933/AA6DA510, prev 933/AA6D8F08, desc: LOCK off 12: xid 1743220: flags 0
LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/13117/874105 blk 0
rmgr: Heaplen (rec/tot): 81/81, tx:1743220, lsn:
933/AA6DA548, prev 933/AA6DA510, desc: INSERT off 19, blkref #0: rel
1663/13117/868296 blk 0
rmgr: Btree   len (rec/tot): 80/80, tx:1743220, lsn:
933/AA6DA5A0, prev 933/AA6DA548, desc: INSERT_LEAF off 24, blkref #0:
rel 1663/13117/868299 blk 1
rmgr: Btree   len (rec/tot): 72/72, tx:1743220, lsn:
933/AA6DA5F0, prev 933/AA6DA5A0, desc: INSERT_LEAF off 23, blkref #0:
rel 1663/13117/915254 blk 2
rmgr: Transaction len (rec/tot): 34/34, tx:1743220, lsn:
933/AA6DA638, prev 933/AA6DA5F0, desc: COMMIT 2020-11-03 14:49:29.205562 UTC
pg_waldump: FATAL:  error in WAL record at 933/AA6DA638: invalid record
length at 933/AA6DA660: wanted 24, got 0

It looks like the WAL in the backup is inconsistent at the end but I'd
expect pg_basebackup to create a clean backup. Is this something I
should expect or is this a problem with pg_basebackup?

Regards,
  Dennis




Re: "invalid record length" after restoring pg_basebackup

2020-11-16 Thread Dennis Jacobfeuerborn
On 11/13/20 4:02 PM, Tom Lane wrote:
> Dennis Jacobfeuerborn  writes:
>> All of this works fine and the logs report that the db reaches a
>> consistent recovery state but as last entry it reports an "invalid
>> record length":
> 
> This looks quite normal to me.  If you'd pulled the power plug on the
> primary system at the time you made this backup, you would likely see
> the same message at the end of its crash recovery.  Some sort of
> corrupt-WAL-entry report is expected at the end of WAL replay anytime
> you didn't have a clean shutdown.

But the system the backup was pulled from kept running fine. Also
wouldn't that make re-attaching the system to the primary impossible
since replication cannot be continued due to the broken WAL record?
What I would expect is that pg_basebackup only transfer healthy WAL
entries so that a restored system can pick up right after that with
streaming replication.

Regards,
 Dennis




Why would Postgres 11 suddenly ignore the recovery.conf file?

2021-10-04 Thread Dennis Jacobfeuerborn

Hi,
I just stopped and restarted a Postgres 11 docker container which so far 
has happily streamed the WAL from its primary but after the restart it 
now seems to completely ignore the recovery.conf file and just outputs this:


 2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv4 address 
"0.0.0.0", port 5432
2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv6 address "::", 
port 5432
2021-10-04 10:14:19.123 UTC [1] LOG:  listening on Unix socket 
"/var/run/postgresql/.s.PGSQL.5432"
2021-10-04 10:14:19.304 UTC [21] LOG:  database system was shut down in 
recovery at 2021-10-04 10:14:11 UTC

2021-10-04 10:14:19.305 UTC [21] LOG:  entering standby mode
2021-10-04 10:14:19.313 UTC [21] LOG:  could not signal for checkpoint: 
checkpointer is not running

2021-10-04 10:14:19.338 UTC [21] LOG:  redo starts at 2F01/B9044C30
2021-10-04 10:15:35.889 UTC [21] LOG:  consistent recovery state reached 
at 2F02/A2E1A108
2021-10-04 10:15:35.893 UTC [1] LOG:  database system is ready to accept 
read only connections


Notice that there is apparently no attempt being made to connect to the 
primary any more. The recovery.conf file is still present and correct so 
I don't understand why Postgres seems to ignore it completely?


Regards,
  Dennis




Re: Why would Postgres 11 suddenly ignore the recovery.conf file?

2021-10-04 Thread Dennis Jacobfeuerborn
That happened with Postgres 12 I think so it shouldn't affect this 
installation.


On 10/4/21 12:40, Josef Šimánek wrote:

Isn't this related to merge into postgresql.conf
(https://www.postgresql.org/docs/current/recovery-config.html)?

po 4. 10. 2021 v 12:35 odesílatel Dennis Jacobfeuerborn
 napsal:


Hi,
I just stopped and restarted a Postgres 11 docker container which so far
has happily streamed the WAL from its primary but after the restart it
now seems to completely ignore the recovery.conf file and just outputs this:

   2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv4 address
"0.0.0.0", port 5432
2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv6 address "::",
port 5432
2021-10-04 10:14:19.123 UTC [1] LOG:  listening on Unix socket
"/var/run/postgresql/.s.PGSQL.5432"
2021-10-04 10:14:19.304 UTC [21] LOG:  database system was shut down in
recovery at 2021-10-04 10:14:11 UTC
2021-10-04 10:14:19.305 UTC [21] LOG:  entering standby mode
2021-10-04 10:14:19.313 UTC [21] LOG:  could not signal for checkpoint:
checkpointer is not running
2021-10-04 10:14:19.338 UTC [21] LOG:  redo starts at 2F01/B9044C30
2021-10-04 10:15:35.889 UTC [21] LOG:  consistent recovery state reached
at 2F02/A2E1A108
2021-10-04 10:15:35.893 UTC [1] LOG:  database system is ready to accept
read only connections

Notice that there is apparently no attempt being made to connect to the
primary any more. The recovery.conf file is still present and correct so
I don't understand why Postgres seems to ignore it completely?

Regards,
Dennis








Re: Why would Postgres 11 suddenly ignore the recovery.conf file?

2021-10-04 Thread Dennis Jacobfeuerborn

On 10/4/21 12:35, Dennis Jacobfeuerborn wrote:

Hi,
I just stopped and restarted a Postgres 11 docker container which so far 
has happily streamed the WAL from its primary but after the restart it 
now seems to completely ignore the recovery.conf file and just outputs 
this:


  2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv4 address 
"0.0.0.0", port 5432
2021-10-04 10:14:19.103 UTC [1] LOG:  listening on IPv6 address "::", 
port 5432
2021-10-04 10:14:19.123 UTC [1] LOG:  listening on Unix socket 
"/var/run/postgresql/.s.PGSQL.5432"
2021-10-04 10:14:19.304 UTC [21] LOG:  database system was shut down in 
recovery at 2021-10-04 10:14:11 UTC

2021-10-04 10:14:19.305 UTC [21] LOG:  entering standby mode
2021-10-04 10:14:19.313 UTC [21] LOG:  could not signal for checkpoint: 
checkpointer is not running

2021-10-04 10:14:19.338 UTC [21] LOG:  redo starts at 2F01/B9044C30
2021-10-04 10:15:35.889 UTC [21] LOG:  consistent recovery state reached 
at 2F02/A2E1A108
2021-10-04 10:15:35.893 UTC [1] LOG:  database system is ready to accept 
read only connections


Notice that there is apparently no attempt being made to connect to the 
primary any more. The recovery.conf file is still present and correct so 
I don't understand why Postgres seems to ignore it completely?


Ok, so just as I sent the mail the log now showed this:

2021-10-04 10:37:53.841 UTC [45] LOG:  started streaming WAL from 
primary at 2F14/1300 on timeline 1


Notice that it took over 20 minutes for the system to start streaming. 
Could this be a connection related issue? If so is there a way to see 
that Postgres is trying to connect to the primary but hasn't succeeded 
yet? From the looks of it the logs stays silent until is actually has 
started streaming and there is no "initiating recovery" message before 
the connection actually succeeds.


Regards,
  Dennis