How does pg_basebackup manage to create a snapshot of the filesystem?
Hi, I'm currently trying to understand how backups work. In the documentation in section "25.2. File System Level Backup" it says that filesystem level backups can only be made when the database if offline yet pg_basebackup seems to do just that but works while the database is online. Am I misunderstanding something here or does pg_basebackup use some particular features of Postgres to accomplish this? Regards, Dennis
"invalid record length" after restoring pg_basebackup
Hi, I've run into a strange issue after restoring a backup that I created using pg_basebackup on a standby instance. The command I use to create the backup is this: pg_basebackup -v --write-recovery-conf -h$BACKUP_HOST -p5432 -U$BACKUP_USER --format tar --wal-method stream --compress=2 -D "$BACKUP_DIR" This backup runs fine and produces a "base.tar.gz" and "pg_wal.tar.gz" file. The server version is 11.7. To restore the backup on another system I unpack the "base.tar.gz" file into the data directory and "pg_wal.tar.gz" into the "pg_wal" sub-directory. I then comment out the "primary_conninfo" directive in the "recovery.conf" file since I don't want to replicate from the primary on this system I just want to restore the state from the backup. All of this works fine and the logs report that the db reaches a consistent recovery state but as last entry it reports an "invalid record length": 2020-11-13 12:25:34.266 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2020-11-13 12:25:34.266 UTC [1] LOG: listening on IPv6 address "::", port 5432 2020-11-13 12:25:34.269 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2020-11-13 12:25:34.437 UTC [20] LOG: database system was interrupted while in recovery at log time 2020-11-03 14:36:00 UTC 2020-11-13 12:25:34.437 UTC [20] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2020-11-13 12:25:35.359 UTC [20] WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command 2020-11-13 12:25:35.359 UTC [20] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there. 2020-11-13 12:25:35.359 UTC [20] LOG: entering standby mode 2020-11-13 12:25:35.372 UTC [20] LOG: redo starts at 932/9D123E70 2020-11-13 12:26:45.435 UTC [20] LOG: consistent recovery state reached at 933/AA63F600 2020-11-13 12:26:45.435 UTC [1] LOG: database system is ready to accept read only connections 2020-11-13 12:26:45.444 UTC [20] LOG: invalid record length at 933/AA6DA660: wanted 24, got 0 Looking at the WAL data the last file is "0001093300AA" and there is a file "archive_status/0001093300A9.done" but no corresponding ".done" file for the "...AA" WAL file. Running pg_waldump on the "...AA" file shows the following at the end: $ pg_waldump 0001093300AA ... rmgr: Heaplen (rec/tot): 54/54, tx:1743220, lsn: 933/AA6DA510, prev 933/AA6D8F08, desc: LOCK off 12: xid 1743220: flags 0 LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/13117/874105 blk 0 rmgr: Heaplen (rec/tot): 81/81, tx:1743220, lsn: 933/AA6DA548, prev 933/AA6DA510, desc: INSERT off 19, blkref #0: rel 1663/13117/868296 blk 0 rmgr: Btree len (rec/tot): 80/80, tx:1743220, lsn: 933/AA6DA5A0, prev 933/AA6DA548, desc: INSERT_LEAF off 24, blkref #0: rel 1663/13117/868299 blk 1 rmgr: Btree len (rec/tot): 72/72, tx:1743220, lsn: 933/AA6DA5F0, prev 933/AA6DA5A0, desc: INSERT_LEAF off 23, blkref #0: rel 1663/13117/915254 blk 2 rmgr: Transaction len (rec/tot): 34/34, tx:1743220, lsn: 933/AA6DA638, prev 933/AA6DA5F0, desc: COMMIT 2020-11-03 14:49:29.205562 UTC pg_waldump: FATAL: error in WAL record at 933/AA6DA638: invalid record length at 933/AA6DA660: wanted 24, got 0 It looks like the WAL in the backup is inconsistent at the end but I'd expect pg_basebackup to create a clean backup. Is this something I should expect or is this a problem with pg_basebackup? Regards, Dennis
Re: "invalid record length" after restoring pg_basebackup
On 11/13/20 4:02 PM, Tom Lane wrote: > Dennis Jacobfeuerborn writes: >> All of this works fine and the logs report that the db reaches a >> consistent recovery state but as last entry it reports an "invalid >> record length": > > This looks quite normal to me. If you'd pulled the power plug on the > primary system at the time you made this backup, you would likely see > the same message at the end of its crash recovery. Some sort of > corrupt-WAL-entry report is expected at the end of WAL replay anytime > you didn't have a clean shutdown. But the system the backup was pulled from kept running fine. Also wouldn't that make re-attaching the system to the primary impossible since replication cannot be continued due to the broken WAL record? What I would expect is that pg_basebackup only transfer healthy WAL entries so that a restored system can pick up right after that with streaming replication. Regards, Dennis
Why would Postgres 11 suddenly ignore the recovery.conf file?
Hi, I just stopped and restarted a Postgres 11 docker container which so far has happily streamed the WAL from its primary but after the restart it now seems to completely ignore the recovery.conf file and just outputs this: 2021-10-04 10:14:19.103 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-10-04 10:14:19.103 UTC [1] LOG: listening on IPv6 address "::", port 5432 2021-10-04 10:14:19.123 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2021-10-04 10:14:19.304 UTC [21] LOG: database system was shut down in recovery at 2021-10-04 10:14:11 UTC 2021-10-04 10:14:19.305 UTC [21] LOG: entering standby mode 2021-10-04 10:14:19.313 UTC [21] LOG: could not signal for checkpoint: checkpointer is not running 2021-10-04 10:14:19.338 UTC [21] LOG: redo starts at 2F01/B9044C30 2021-10-04 10:15:35.889 UTC [21] LOG: consistent recovery state reached at 2F02/A2E1A108 2021-10-04 10:15:35.893 UTC [1] LOG: database system is ready to accept read only connections Notice that there is apparently no attempt being made to connect to the primary any more. The recovery.conf file is still present and correct so I don't understand why Postgres seems to ignore it completely? Regards, Dennis
Re: Why would Postgres 11 suddenly ignore the recovery.conf file?
That happened with Postgres 12 I think so it shouldn't affect this installation. On 10/4/21 12:40, Josef Šimánek wrote: Isn't this related to merge into postgresql.conf (https://www.postgresql.org/docs/current/recovery-config.html)? po 4. 10. 2021 v 12:35 odesílatel Dennis Jacobfeuerborn napsal: Hi, I just stopped and restarted a Postgres 11 docker container which so far has happily streamed the WAL from its primary but after the restart it now seems to completely ignore the recovery.conf file and just outputs this: 2021-10-04 10:14:19.103 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-10-04 10:14:19.103 UTC [1] LOG: listening on IPv6 address "::", port 5432 2021-10-04 10:14:19.123 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2021-10-04 10:14:19.304 UTC [21] LOG: database system was shut down in recovery at 2021-10-04 10:14:11 UTC 2021-10-04 10:14:19.305 UTC [21] LOG: entering standby mode 2021-10-04 10:14:19.313 UTC [21] LOG: could not signal for checkpoint: checkpointer is not running 2021-10-04 10:14:19.338 UTC [21] LOG: redo starts at 2F01/B9044C30 2021-10-04 10:15:35.889 UTC [21] LOG: consistent recovery state reached at 2F02/A2E1A108 2021-10-04 10:15:35.893 UTC [1] LOG: database system is ready to accept read only connections Notice that there is apparently no attempt being made to connect to the primary any more. The recovery.conf file is still present and correct so I don't understand why Postgres seems to ignore it completely? Regards, Dennis
Re: Why would Postgres 11 suddenly ignore the recovery.conf file?
On 10/4/21 12:35, Dennis Jacobfeuerborn wrote: Hi, I just stopped and restarted a Postgres 11 docker container which so far has happily streamed the WAL from its primary but after the restart it now seems to completely ignore the recovery.conf file and just outputs this: 2021-10-04 10:14:19.103 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-10-04 10:14:19.103 UTC [1] LOG: listening on IPv6 address "::", port 5432 2021-10-04 10:14:19.123 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2021-10-04 10:14:19.304 UTC [21] LOG: database system was shut down in recovery at 2021-10-04 10:14:11 UTC 2021-10-04 10:14:19.305 UTC [21] LOG: entering standby mode 2021-10-04 10:14:19.313 UTC [21] LOG: could not signal for checkpoint: checkpointer is not running 2021-10-04 10:14:19.338 UTC [21] LOG: redo starts at 2F01/B9044C30 2021-10-04 10:15:35.889 UTC [21] LOG: consistent recovery state reached at 2F02/A2E1A108 2021-10-04 10:15:35.893 UTC [1] LOG: database system is ready to accept read only connections Notice that there is apparently no attempt being made to connect to the primary any more. The recovery.conf file is still present and correct so I don't understand why Postgres seems to ignore it completely? Ok, so just as I sent the mail the log now showed this: 2021-10-04 10:37:53.841 UTC [45] LOG: started streaming WAL from primary at 2F14/1300 on timeline 1 Notice that it took over 20 minutes for the system to start streaming. Could this be a connection related issue? If so is there a way to see that Postgres is trying to connect to the primary but hasn't succeeded yet? From the looks of it the logs stays silent until is actually has started streaming and there is no "initiating recovery" message before the connection actually succeeds. Regards, Dennis