Re: psql: FATAL: the database system is starting up
On Tue, May 28, 2019 at 9:53 AM Adrian Klaver wrote: > On 5/27/19 9:59 PM, Tom K wrote: > > Hey Guy's, > > > > > > I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3 > > RHEL 7.6 VM's. Every now and then the underlying storage crashes taking > > out the cluster. On recovery, PostgreSQL tends to come up while other > > databases just blow up. That is until now. > > The above is one master and two standby servers connected via streaming > replication, correct? > Is there WAL archiving set up? > Correct. Master election occurs through Patroni. WAL level is set to: wal_level = 'replica' So no archiving. > > > > > > After the most recent crash 2-3 weeks ago, the cluster is now running > > into this message but I'm not able to make heads or tails out of why > > it's throwing this: > > So you have not been able to run the cluster the past 2-3 weeks or is > that more recent? > Haven't been able to bring this PostgresSQL cluster up ( run the cluster ) since 2-3 weeks ago. Tried quite a few combinations of options to recover this. No luck. Had storage failures earlier, even with corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to come up each time without any recovery effort on my part. > > When you refer to history files below are you talking about WAL files or > something else? > > Is this: > > "recovery command file "recovery.conf" specified neither > primary_conninfo nor restore_command" > > true? > True. recovery.conf is controlled by Patroni. Contents of this file remained the same for all the cluster nodes with the exception of the primary_slot_name: [root@psql01 postgresql-patroni-etcd]# cat recovery.conf primary_slot_name = 'postgresql0' standby_mode = 'on' recovery_target_timeline = 'latest' [root@psql01 postgresql-patroni-etcd]# [root@psql02 postgres-backup]# cat recovery.conf primary_slot_name = 'postgresql1' standby_mode = 'on' recovery_target_timeline = 'latest' [root@psql02 postgres-backup]# [root@psql03 postgresql-patroni-backup]# cat recovery.conf primary_slot_name = 'postgresql2' standby_mode = 'on' recovery_target_timeline = 'latest' [root@psql03 postgresql-patroni-backup]# I've made a copy of the root postgres directory over to another location so when troubleshooting, I can always revert to the first state the cluster was in when it failed. Thx, TK > > > > > > > |-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni > > --config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT > > [10101]LOG:listening onIPv4 address "192.168.0.108",port > > 54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket > > "./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log > > output tologging collector process 2019-05-2308:26:34.591EDT > > [10101]HINT:Future log output will appear indirectory > > "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni > > --config-file=/data/patroni/postgresql.conf --hot_standby=on > > --listen_addresses=192.168.0.108 --max_worker_processes=8 > > --max_locks_per_transaction=64 --wal_level=replica > > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 > > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 > > --max_replication_slots=10 > > --max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening > > onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT > > [10105]LOG:listening onUnix socket > > "./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log > > output tologging collector process 2019-05-2308:26:45.740EDT > > [10105]HINT:Future log output will appear indirectory > > "log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni > > --config-file=/data/patroni/postgresql.conf --hot_standby=on > > --listen_addresses=192.168.0.1 --max_worker_processes=8 > > --max_locks_per_transaction=64 --wal_level=replica > > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 > > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 > > --max_replication_slots=10 > > --max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening > > onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT > > [10188]LOG:listening onUnix socket > > "./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log > > output tologging collector process 2019-05-2308:29:36.040EDT > > [10188]HINT:Future log output
Re: psql: FATAL: the database system is starting up
On Wed, May 29, 2019 at 10:28 AM Adrian Klaver wrote: > On 5/28/19 6:59 PM, Tom K wrote: > > > > > > On Tue, May 28, 2019 at 9:53 AM Adrian Klaver > <mailto:adrian.kla...@aklaver.com>> wrote: > > > > > > > Correct. Master election occurs through Patroni. WAL level is set to: > > > > wal_level = 'replica' > > > > So no archiving. > > > > > > > > > > > > After the most recent crash 2-3 weeks ago, the cluster is now > > running > > > into this message but I'm not able to make heads or tails out of > why > > > it's throwing this: > > > > So you have not been able to run the cluster the past 2-3 weeks or is > > that more recent? > > > > > > Haven't been able to bring this PostgresSQL cluster up ( run the cluster > > ) since 2-3 weeks ago. Tried quite a few combinations of options to > > recover this. No luck. Had storage failures earlier, even with > > corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to > > come up each time without any recovery effort on my part. > > > > > > When you refer to history files below are you talking about WAL > > files or > > something else? > > > > Is this: > > > > "recovery command file "recovery.conf" specified neither > > primary_conninfo nor restore_command" > > > > true? > > > > > > True. recovery.conf is controlled by Patroni. Contents of this file > > remained the same for all the cluster nodes with the exception of the > > primary_slot_name: > > > > [root@psql01 postgresql-patroni-etcd]# cat recovery.conf > > primary_slot_name = 'postgresql0' > > standby_mode = 'on' > > recovery_target_timeline = 'latest' > > [root@psql01 postgresql-patroni-etcd]# > > > > [root@psql02 postgres-backup]# cat recovery.conf > > primary_slot_name = 'postgresql1' > > standby_mode = 'on' > > recovery_target_timeline = 'latest' > > [root@psql02 postgres-backup]# > > > > [root@psql03 postgresql-patroni-backup]# cat recovery.conf > > primary_slot_name = 'postgresql2' > > standby_mode = 'on' > > recovery_target_timeline = 'latest' > > [root@psql03 postgresql-patroni-backup]# > > > > I've made a copy of the root postgres directory over to another location > > so when troubleshooting, I can always revert to the first state the > > cluster was in when it failed. > > I have no experience with Patroni so I will be of no help there. You > might get more useful information from: > > https://github.com/zalando/patroni > Community > > There are two places to connect with the Patroni community: on github, > via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If > you're using Patroni, or just interested, please join us. > Will post there as well. Thank you. My thinking was to post here first since I suspect the Patroni community will simply refer me back here given that the PostgreSQL errors are originating directly from PostgreSQL. > > That being said, can you start the copied Postgres instance without > using the Patroni instrumentation? > Yes, that is something I have been trying to do actually. But I hit a dead end with the three errors above. So what I did is to copy a single node's backed up copy of the data files to */data/patroni* of the same node ( this is the psql data directory as defined through patroni ) of the same node then ran this ( psql03 = 192.168.0.118 ): # sudo su - postgres $ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --listen_addresses=192.168.0.118 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100 --hot_standby=on --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5 This resulted in one of the 3 messages above. Hence the post here. If I can start a single instance, I should be fine since I could then 1) replicate over to the other two or 2) simply take a dump, reinitialize all the databases then restore the dump. Using the above procedure I get one of three error messages when using the data files of each node: [ PSQL01 ] postgres: postgres: startup process waiting for 00010008 [ PSQL02 ] PANIC:replicationcheckpointhas wrong magic 0 instead of 307747550 [ PSQL03 } FATAL:syntax error inhistory file:f2W And I can't start any one of them. > > > > > Thx, > > TK > > > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 9:55 AM Adrian Klaver wrote: > On 5/31/19 7:53 PM, Tom K wrote: > > > > > There are two places to connect with the Patroni community: on > github, > > via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. > If > > you're using Patroni, or just interested, please join us. > > > > > > Will post there as well. Thank you. My thinking was to post here first > > since I suspect the Patroni community will simply refer me back here > > given that the PostgreSQL errors are originating directly from > PostgreSQL. > > > > > > That being said, can you start the copied Postgres instance without > > using the Patroni instrumentation? > > > > > > Yes, that is something I have been trying to do actually. But I hit a > > dead end with the three errors above. > > > > So what I did is to copy a single node's backed up copy of the data > > files to */data/patroni* of the same node ( this is the psql data > > directory as defined through patroni ) of the same node then ran this ( > > psql03 = 192.168.0.118 ): > > > > # sudo su - postgres > > $ /usr/pgsql-10/bin/postgres -D /data/patroni > > --config-file=/data/patroni/postgresql.conf > > --listen_addresses=192.168.0.118 --max_worker_processes=8 > > --max_locks_per_transaction=64 --wal_level=replica > > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 > > --max_replication_slots=10 --max_connections=100 --hot_standby=on > > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5 > > Why all the options? > That should be covered in postgresql.conf, no? > > > > > This resulted in one of the 3 messages above. Hence the post here. If > > I can start a single instance, I should be fine since I could then 1) > > replicate over to the other two or 2) simply take a dump, reinitialize > > all the databases then restore the dump. > > > > What if you move the recovery.conf file out? Will try. > > The below looks like missing/corrupted/incorrect files. Hard to tell > without knowing what Patroni did? Storage disappeared from underneath these clusters. The OS was of course still in memory making futile attempts to write to disk, which would never complete. My best guess is that Patroni or postgress was in the middle of some writes across the clusters when the failure occurred. > > > Using the above procedure I get one of three error messages when using > > the data files of each node: > > > > [ PSQL01 ] > > postgres: postgres: startup process waiting for 00010008 > > > > [ PSQL02 ] > > PANIC:replicationcheckpointhas wrong magic 0 instead of 307747550 > > > > [ PSQL03 } > > FATAL:syntax error inhistory file:f2W > > > > And I can't start any one of them. > > > > > > > > > > > > Thx, > > > TK > > > > > > > > > > > -- > > Adrian Klaver > > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> > > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 3:32 PM Tom K wrote: > > > On Sat, Jun 1, 2019 at 9:55 AM Adrian Klaver > wrote: > >> On 5/31/19 7:53 PM, Tom K wrote: >> > >> >> > There are two places to connect with the Patroni community: on >> github, >> > via Issues and PRs, and on channel #patroni in the PostgreSQL >> Slack. If >> > you're using Patroni, or just interested, please join us. >> > >> > >> > Will post there as well. Thank you. My thinking was to post here >> first >> > since I suspect the Patroni community will simply refer me back here >> > given that the PostgreSQL errors are originating directly from >> PostgreSQL. >> > >> > >> > That being said, can you start the copied Postgres instance without >> > using the Patroni instrumentation? >> > >> > >> > Yes, that is something I have been trying to do actually. But I hit a >> > dead end with the three errors above. >> > >> > So what I did is to copy a single node's backed up copy of the data >> > files to */data/patroni* of the same node ( this is the psql data >> > directory as defined through patroni ) of the same node then ran this ( >> > psql03 = 192.168.0.118 ): >> > >> > # sudo su - postgres >> > $ /usr/pgsql-10/bin/postgres -D /data/patroni >> > --config-file=/data/patroni/postgresql.conf >> > --listen_addresses=192.168.0.118 --max_worker_processes=8 >> > --max_locks_per_transaction=64 --wal_level=replica >> > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 >> > --max_replication_slots=10 --max_connections=100 --hot_standby=on >> > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5 >> >> Why all the options? >> That should be covered in postgresql.conf, no? >> >> > >> > This resulted in one of the 3 messages above. Hence the post here. If >> > I can start a single instance, I should be fine since I could then 1) >> > replicate over to the other two or 2) simply take a dump, reinitialize >> > all the databases then restore the dump. >> > >> >> What if you move the recovery.conf file out? > > > Will try. > > >> >> The below looks like missing/corrupted/incorrect files. Hard to tell >> without knowing what Patroni did? > > > Storage disappeared from underneath these clusters. The OS was of course > still in memory making futile attempts to write to disk, which would never > complete. > > My best guess is that Patroni or postgress was in the middle of some > writes across the clusters when the failure occurred. > Of note are the characters f2W below. I see nothing in the postgres source code to indicate this is any recognizable postgres message. A part of me suspects that the postgres binaries got corrupted. Had this case occur with glib-common and a reinstall fixed it. However the postgres binaries csum matches a standalone install perfectly so that should not be an issue. > >> >> > Using the above procedure I get one of three error messages when using >> > the data files of each node: >> > >> > [ PSQL01 ] >> > postgres: postgres: startup process waiting for 00010008 >> > >> > [ PSQL02 ] >> > PANIC:replicationcheckpointhas wrong magic 0 instead of 307747550 >> > >> > [ PSQL03 } >> > FATAL:syntax error inhistory file:f2W >> > >> > And I can't start any one of them. >> > >> > >> > >> > > >> > > Thx, >> > > TK >> > > >> > >> > >> > >> > -- >> > Adrian Klaver >> > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com> >> > >> >> >> -- >> Adrian Klaver >> adrian.kla...@aklaver.com >> >
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 4:11 PM Adrian Klaver wrote: > On 6/1/19 12:32 PM, Tom K wrote: > > > > > > > > > What if you move the recovery.conf file out? > > > > > > Will try. > > > > > > > > The below looks like missing/corrupted/incorrect files. Hard to tell > > without knowing what Patroni did? > > > > > > Storage disappeared from underneath these clusters. The OS was of > > course still in memory making futile attempts to write to disk, which > > would never complete. > > > > My best guess is that Patroni or postgress was in the middle of some > > writes across the clusters when the failure occurred. > > So to be clear all three clusters where writing to the same storage > medium and there was no WAL archiving to some other storage? Yep, cheap LAB hardware with no power redundancy ( yet ) . > > > > > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 4:52 PM Adrian Klaver wrote: > On 6/1/19 12:42 PM, Tom K wrote: > > > > > > > > > Of note are the characters f2W below. I see nothing in the postgres > > source code to indicate this is any recognizable postgres message. A > > part of me suspects that the postgres binaries got corrupted. Had this > > case occur with glib-common and a reinstall fixed it. However the > > postgres binaries csum matches a standalone install perfectly so that > > should not be an issue. > > It comes from timeline.c: > > https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html > > pg_log_error("syntax error in history file: %s", fline); > > ... > > There should be another error message after the above. > Nope. Here's the full set of lines in the postgres logs when running the above line: 2019-06-01 17:13:03.261 EDT [14909] LOG: database system was shut down at 2019-05-22 01:55:13 EDT 2019-06-01 17:13:03.261 EDT [14909] DEBUG: primary_slot_name = 'postgresql2' 2019-06-01 17:13:03.261 EDT [14909] DEBUG: standby_mode = 'on' 2019-06-01 17:13:03.261 EDT [14909] DEBUG: recovery_target_timeline = latest 2019-06-01 17:13:03.261 EDT [14909] WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command 2019-06-01 17:13:03.261 EDT [14909] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there. 2019-06-01 17:13:03.263 EDT [14909] LOG: entering standby mode 2019-06-01 17:13:03.263 EDT [14909] FATAL: syntax error in history file: f2W 2019-06-01 17:13:03.263 EDT [14909] HINT: Expected a numeric timeline ID. 2019-06-01 17:13:03.263 EDT [14909] DEBUG: shmem_exit(1): 1 before_shmem_exit callbacks to make 2019-06-01 17:13:03.263 EDT [14909] DEBUG: shmem_exit(1): 4 on_shmem_exit callbacks to make 2019-06-01 17:13:03.263 EDT [14909] DEBUG: proc_exit(1): 2 callbacks to make 2019-06-01 17:13:03.263 EDT [14909] DEBUG: exit(1) 2019-06-01 17:13:03.264 EDT [14909] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2019-06-01 17:13:03.264 EDT [14909] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2019-06-01 17:13:03.264 EDT [14909] DEBUG: proc_exit(-1): 0 callbacks to make 2019-06-01 17:13:03.265 EDT [14907] DEBUG: reaping dead processes 2019-06-01 17:13:03.265 EDT [14907] LOG: startup process (PID 14909) exited with exit code 1 2019-06-01 17:13:03.265 EDT [14907] LOG: aborting startup due to startup process failure 2019-06-01 17:13:03.265 EDT [14907] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make 2019-06-01 17:13:03.266 EDT [14907] DEBUG: shmem_exit(1): 5 on_shmem_exit callbacks to make 2019-06-01 17:13:03.266 EDT [14907] DEBUG: cleaning up dynamic shared memory control segment with ID 1669075945 2019-06-01 17:13:03.268 EDT [14907] DEBUG: proc_exit(1): 2 callbacks to make 2019-06-01 17:13:03.268 EDT [14907] LOG: database system is shut down 2019-06-01 17:13:03.268 EDT [14907] DEBUG: exit(1) 2019-06-01 17:13:03.268 EDT [14907] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2019-06-01 17:13:03.268 EDT [14907] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2019-06-01 17:13:03.268 EDT [14907] DEBUG: proc_exit(-1): 0 callbacks to make 2019-06-01 17:13:03.270 EDT [14908] DEBUG: logger shutting down 2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2019-06-01 17:13:03.270 EDT [14908] DEBUG: proc_exit(0): 0 callbacks to make 2019-06-01 17:13:03.270 EDT [14908] DEBUG: exit(0) 2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2019-06-01 17:13:03.270 EDT [14908] DEBUG: proc_exit(-1): 0 callbacks to make ^C -bash-4.2$ What's interesting is that f2W isn't a string you'd expect to be printed as part of the code logic ( I mean, what is f2W? ). postgresql10-server-10.8-1PGDG.rhel7.x86_64 postgresql10-libs-10.8-1PGDG.rhel7.x86_64 postgresql10-contrib-10.8-1PGDG.rhel7.x86_64 postgresql10-10.8-1PGDG.rhel7.x86_64 Without the recovery.conf file it actually started up. Going to try to get a dump then continue to try and recover in place. 2019-06-01 17:17:29.777 EDT [14936] DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0 2019-06-01 17:17:29.777 EDT [14936] DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0 2019-06-01 17:17:29.778 EDT [14937] DEBUG: received inquiry for database 0 2019-06-01 17:17:29.778 EDT [14937] DEBUG: writing stats file "pg_stat_tmp/global.stat" 2019-06-01 17:17:29.779 EDT [14937] DEBUG: writing
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 5:30 PM Tom K wrote: > > > On Sat, Jun 1, 2019 at 4:52 PM Adrian Klaver > wrote: > >> On 6/1/19 12:42 PM, Tom K wrote: >> > >> > >> >> > >> > Of note are the characters f2W below. I see nothing in the postgres >> > source code to indicate this is any recognizable postgres message. A >> > part of me suspects that the postgres binaries got corrupted. Had >> this >> > case occur with glib-common and a reinstall fixed it. However the >> > postgres binaries csum matches a standalone install perfectly so that >> > should not be an issue. >> >> It comes from timeline.c: >> >> https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html >> >> pg_log_error("syntax error in history file: %s", fline); >> >> ... >> >> There should be another error message after the above. >> > > Nope. Here's the full set of lines in the postgres logs when running the > above line: > > 2019-06-01 17:13:03.261 EDT [14909] LOG: database system was shut down at > 2019-05-22 01:55:13 EDT > 2019-06-01 17:13:03.261 EDT [14909] DEBUG: primary_slot_name = > 'postgresql2' > 2019-06-01 17:13:03.261 EDT [14909] DEBUG: standby_mode = 'on' > 2019-06-01 17:13:03.261 EDT [14909] DEBUG: recovery_target_timeline = > latest > 2019-06-01 17:13:03.261 EDT [14909] WARNING: recovery command file > "recovery.conf" specified neither primary_conninfo nor restore_command > 2019-06-01 17:13:03.261 EDT [14909] HINT: The database server will > regularly poll the pg_wal subdirectory to check for files placed there. > 2019-06-01 17:13:03.263 EDT [14909] LOG: entering standby mode > 2019-06-01 17:13:03.263 EDT [14909] FATAL: syntax error in history file: > f2W > 2019-06-01 17:13:03.263 EDT [14909] HINT: Expected a numeric timeline ID. > 2019-06-01 17:13:03.263 EDT [14909] DEBUG: shmem_exit(1): 1 > before_shmem_exit callbacks to make > 2019-06-01 17:13:03.263 EDT [14909] DEBUG: shmem_exit(1): 4 on_shmem_exit > callbacks to make > 2019-06-01 17:13:03.263 EDT [14909] DEBUG: proc_exit(1): 2 callbacks to > make > 2019-06-01 17:13:03.263 EDT [14909] DEBUG: exit(1) > 2019-06-01 17:13:03.264 EDT [14909] DEBUG: shmem_exit(-1): 0 > before_shmem_exit callbacks to make > 2019-06-01 17:13:03.264 EDT [14909] DEBUG: shmem_exit(-1): 0 > on_shmem_exit callbacks to make > 2019-06-01 17:13:03.264 EDT [14909] DEBUG: proc_exit(-1): 0 callbacks to > make > 2019-06-01 17:13:03.265 EDT [14907] DEBUG: reaping dead processes > 2019-06-01 17:13:03.265 EDT [14907] LOG: startup process (PID 14909) > exited with exit code 1 > 2019-06-01 17:13:03.265 EDT [14907] LOG: aborting startup due to startup > process failure > 2019-06-01 17:13:03.265 EDT [14907] DEBUG: shmem_exit(1): 0 > before_shmem_exit callbacks to make > 2019-06-01 17:13:03.266 EDT [14907] DEBUG: shmem_exit(1): 5 on_shmem_exit > callbacks to make > 2019-06-01 17:13:03.266 EDT [14907] DEBUG: cleaning up dynamic shared > memory control segment with ID 1669075945 > 2019-06-01 17:13:03.268 EDT [14907] DEBUG: proc_exit(1): 2 callbacks to > make > 2019-06-01 17:13:03.268 EDT [14907] LOG: database system is shut down > 2019-06-01 17:13:03.268 EDT [14907] DEBUG: exit(1) > 2019-06-01 17:13:03.268 EDT [14907] DEBUG: shmem_exit(-1): 0 > before_shmem_exit callbacks to make > 2019-06-01 17:13:03.268 EDT [14907] DEBUG: shmem_exit(-1): 0 > on_shmem_exit callbacks to make > 2019-06-01 17:13:03.268 EDT [14907] DEBUG: proc_exit(-1): 0 callbacks to > make > 2019-06-01 17:13:03.270 EDT [14908] DEBUG: logger shutting down > 2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(0): 0 > before_shmem_exit callbacks to make > 2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(0): 0 on_shmem_exit > callbacks to make > 2019-06-01 17:13:03.270 EDT [14908] DEBUG: proc_exit(0): 0 callbacks to > make > 2019-06-01 17:13:03.270 EDT [14908] DEBUG: exit(0) > 2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(-1): 0 > before_shmem_exit callbacks to make > 2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(-1): 0 > on_shmem_exit callbacks to make > 2019-06-01 17:13:03.270 EDT [14908] DEBUG: proc_exit(-1): 0 callbacks to > make > ^C > -bash-4.2$ > > What's interesting is that f2W isn't a string you'd expect to be printed > as part of the code logic ( I mean, what is f2W? ). > > postgresql10-server-10.8-1PGDG.rhel7.x86_64 > postgresql10-libs-10.8-1PGDG.rhel7.x86_64 > postgresql10-contrib-10.8-1PGDG.rhel7.x86_64 > postgresql10-10.8-1PGDG.rhel7.x86_64 > > Without the recovery.conf file it actually started up. Going to tr
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 5:51 PM Adrian Klaver wrote: > On 6/1/19 2:31 PM, Tom K wrote: > > > > > > > > > Spoke too soon. There's no tables when it's started without the > > recovery.conf file. > > Where there any errors in the start up? > Nothing I would discern as a startup error: [root@psql03 patroni]# sudo su - postgres Last login: Sat Jun 1 17:30:31 EDT 2019 on pts/1 -bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --listen_addresses=192.168.0.118 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100 --hot_standby=on --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5 2019-06-01 17:41:27.477 EDT [15209] DEBUG: postgres: PostmasterMain: initial environment dump: 2019-06-01 17:41:27.477 EDT [15209] DEBUG: - 2019-06-01 17:41:27.477 EDT [15209] DEBUG: XDG_SESSION_ID=219 2019-06-01 17:41:27.477 EDT [15209] DEBUG: HOSTNAME=psql03.nix.mds.xyz 2019-06-01 17:41:27.477 EDT [15209] DEBUG: SHELL=/bin/bash 2019-06-01 17:41:27.477 EDT [15209] DEBUG: TERM=xterm 2019-06-01 17:41:27.477 EDT [15209] DEBUG: HISTSIZE=1000 2019-06-01 17:41:27.477 EDT [15209] DEBUG: USER=postgres 2019-06-01 17:41:27.477 EDT [15209] DEBUG: LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36: 2019-06-01 17:41:27.477 EDT [15209] DEBUG: MAIL=/var/spool/mail/postgres 2019-06-01 17:41:27.477 EDT [15209] DEBUG: PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/ 2019-06-01 17:41:27.477 EDT [15209] DEBUG: PWD=/var/lib/pgsql 2019-06-01 17:41:27.477 EDT [15209] DEBUG: LANG=en_US.UTF-8 2019-06-01 17:41:27.477 EDT [15209] DEBUG: HISTCONTROL=ignoredups 2019-06-01 17:41:27.477 EDT [15209] DEBUG: SHLVL=1 2019-06-01 17:41:27.477 EDT [15209] DEBUG: HOME=/var/lib/pgsql 2019-06-01 17:41:27.477 EDT [15209] DEBUG: LOGNAME=postgres 2019-06-01 17:41:27.478 EDT [15209] DEBUG: PGDATA=/var/lib/pgsql/10/data 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LESSOPEN=||/usr/bin/lesspipe.sh %s 2019-06-01 17:41:27.478 EDT [15209] DEBUG: _=/usr/pgsql-10/bin/postgres 2019-06-01 17:41:27.478 EDT [15209] DEBUG: PGLOCALEDIR=/usr/pgsql-10/share/locale 2019-06-01 17:41:27.478 EDT [15209] DEBUG: PGSYSCONFDIR=/etc/sysconfig/pgsql 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_COLLATE=en_US.UTF-8 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_CTYPE=en_US.UTF-8 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_MESSAGES=en_US.UTF-8 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_MONETARY=C 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_NUMERIC=C 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_TIME=C 2019-06-01 17:41:27.478 EDT [15209] DEBUG: - 2019-06-01 17:41:27.494 EDT [15209] DEBUG: registering background worker "logical replication launcher" 2019-06-01 17:41:27.495 EDT [15209] LOG: listening on IPv4 address "192.168.0.118", port 5432 2019-06-01 17:41:27.498 EDT [15209] LOG: listening on Unix socket "./.s.PGSQL.5432" 2019-06-01 17:41:27.499 EDT [15209] DEBUG: invoking IpcMemoryCreate(size=148545536) 2019-06-01 17:41:27.499 EDT [15209] DEBUG: mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory 2019-06-01 17:41:27.514 EDT [15209] DEBUG: SlruScanDirectory invoking callback on pg_notify/ 2019-0
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 6:05 PM Tom K wrote: > > > On Sat, Jun 1, 2019 at 5:51 PM Adrian Klaver > wrote: > >> On 6/1/19 2:31 PM, Tom K wrote: >> > >> > >> >> > >> > Spoke too soon. There's no tables when it's started without the >> > recovery.conf file. >> >> Where there any errors in the start up? >> > > Nothing I would discern as a startup error: > > [root@psql03 patroni]# sudo su - postgres > Last login: Sat Jun 1 17:30:31 EDT 2019 on pts/1 > -bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni > --config-file=/data/patroni/postgresql.conf > --listen_addresses=192.168.0.118 --max_worker_processes=8 > --max_locks_per_transaction=64 --wal_level=replica > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 > --max_replication_slots=10 --max_connections=100 --hot_standby=on > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5 > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: postgres: PostmasterMain: > initial environment dump: > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: > - > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: XDG_SESSION_ID=219 > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: HOSTNAME= > psql03.nix.mds.xyz > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: SHELL=/bin/bash > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: TERM=xterm > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: HISTSIZE=1000 > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: USER=postgres > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: > > LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36: > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: > MAIL=/var/spool/mail/postgres > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: > > PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/ > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: PWD=/var/lib/pgsql > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: LANG=en_US.UTF-8 > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: HISTCONTROL=ignoredups > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: SHLVL=1 > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: HOME=/var/lib/pgsql > 2019-06-01 17:41:27.477 EDT [15209] DEBUG: LOGNAME=postgres > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: > PGDATA=/var/lib/pgsql/10/data > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: > LESSOPEN=||/usr/bin/lesspipe.sh %s > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: > _=/usr/pgsql-10/bin/postgres > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: > PGLOCALEDIR=/usr/pgsql-10/share/locale > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: > PGSYSCONFDIR=/etc/sysconfig/pgsql > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_COLLATE=en_US.UTF-8 > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_CTYPE=en_US.UTF-8 > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_MESSAGES=en_US.UTF-8 > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_MONETARY=C > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_NUMERIC=C > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_TIME=C > 2019-06-01 17:41:27.478 EDT [15209] DEBUG: > - > 2019-06-01 17:41:27.494 EDT [15209] DEBUG: registering background worker > "logical replication launcher" > 2019-06-01 17:41:27.495 EDT [15209] LOG: listening on IPv4 address > "192.168.0.118", port 5432 > 2019-06-
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 6:39 PM Adrian Klaver wrote: > On 6/1/19 3:14 PM, Tom K wrote: > > > > > ** Correction. There is postgres, template1 and template2 but none of > > the other databases we had. > > Just noticed, is that really template2 or is it actually template0? Apologies. You're right, that was a typo: -bash-4.2$ psql -h psql03 -U postgres Password for user postgres: psql (10.8) Type "help" for help. postgres=# \l List of databases Name| Owner | Encoding | Collate |Ctype| Access privileges ---+--+--+-+-+--- postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres (3 rows) postgres=# > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 6:36 PM Adrian Klaver wrote: > On 6/1/19 3:14 PM, Tom K wrote: > > > > > > > ** Correction. There is postgres, template1 and template2 but none of > > the other databases we had. > > In a psql session do: > > select oid, datname from pg_database ; > > Then go to /data/patroni an drill down to the base directory. In that > directory there should be sub-directories with numbers equal to the oid > values from the query. If not then something is wrong. That could be: > > 1) Wrong data directory. > > 2) Copied over incorrect data directory. > > Basically if the only sub-directories are for the system databases then > the user databases do not exist. You might want to look at some of the > other clusters to see what is there. > > > -bash-4.2$ pwd /var/lib/pgsql -bash-4.2$ psql -h psql03 -U postgres Password for user postgres: psql (10.8) Type "help" for help. postgres=# \l List of databases Name| Owner | Encoding | Collate |Ctype| Access privileges ---+--+--+-+-+--- postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres (3 rows) postgres=# select oid, datname from pg_database ; oid | datname ---+--- 13806 | postgres 1 | template1 13805 | template0 (3 rows) postgres=# postgres=# cd /data/patroni postgres-# ls -altri postgres-# pwd postgres-# pwd postgres-# ls -altrid postgres-# ; ERROR: syntax error at or near "cd" LINE 1: cd /data/patroni ^ postgres=# \q -bash-4.2$ ls -altri total 652 134299846 drwxr-xr-x. 39 root root 4096 Oct 24 2018 .. 67178653 -rw-r--r--. 1 postgres postgres 30 Oct 24 2018 .pgsql_profile 134734937 drwx--. 4 postgres postgres 31 May 8 06:25 10 67178637 drwx--. 3 postgres postgres 4096 May 8 06:25 . 68994450 -rw-r--r--. 1 postgres postgres 204336 May 26 22:37 s1.log 67677594 -rwx--. 1 postgres postgres266 May 26 23:06 .bash_profile 68994451 -rw-r--r--. 1 postgres postgres 433920 May 26 23:44 s2.log 68391913 -rw---. 1 postgres postgres 4716 Jun 1 17:34 .bash_history 67178638 -rw---. 1 postgres postgres325 Jun 1 18:54 .psql_history -bash-4.2$ cd /data/patroni/ -bash-4.2$ ls -altri total 144 69085037 drwxr-xr-x. 3 root root20 Oct 23 2018 .. 135316997 -rw---. 1 postgres postgres 206 Oct 29 2018 backup_label.old 201708781 drwx--. 2 postgres postgres 6 Oct 29 2018 pg_commit_ts 1502746 drwx--. 2 postgres postgres 6 Oct 29 2018 pg_dynshmem 68994449 drwx--. 2 postgres postgres 6 Oct 29 2018 pg_twophase 1502749 drwx--. 2 postgres postgres 6 Oct 29 2018 pg_snapshots 201708785 drwx--. 2 postgres postgres 6 Oct 29 2018 pg_serial 1502747 drwx--. 4 postgres postgres34 Oct 29 2018 pg_multixact 67677559 drwx--. 5 postgres postgres38 Oct 29 2018 base 1502751 drwx--. 2 postgres postgres17 Oct 29 2018 pg_xact 1502750 drwx--. 2 postgres postgres 6 Oct 29 2018 pg_tblspc 134668882 -rw---. 1 postgres postgres88 Oct 29 2018 postgresql.auto.conf 134500181 -rw---. 1 postgres postgres 3 Oct 29 2018 PG_VERSION 134500173 -rw---. 1 postgres postgres 1636 Oct 29 2018 pg_ident.conf 134500171 -rw---. 1 postgres postgres 4730 Oct 29 2018 pg_hba.conf 134668883 -rw---. 1 postgres postgres 22783 Oct 29 2018 postgresql.base.conf 134500170 -rw---. 1 postgres postgres 124 Oct 29 2018 patroni.dynamic.json 134668889 -rw-r--r--. 1 postgres postgres90 Oct 29 2018 recovery.done 201708786 drwx--. 2 postgres postgres17 Oct 29 2018 pg_subtrans 134500162 drwx--. 2 postgres postgres 4096 Nov 4 2018 log 134500182 drwx--. 3 postgres postgres 8192 Apr 28 06:06 pg_wal 1502748 drwx--. 4 postgres postgres42 May 22 01:55 pg_replslot 134668893 -rw-r--r--. 1 postgres postgres90 May 26 22:34 t.txt 134668885 -rw-r--r--. 1 postgres postgres 575 May 26 22:34 postgresql.conf 134668886 -rw-r--r--. 1 postgres postgres 575 May 26 22:34 postgresql.conf.backup 134668884 -rw---. 1 postgres postgres 22783 May 26 22:34 postgresql.base.conf.backup 134500172 -rw---. 1 postgres postgres 4730 May 26 22:34 pg_hba.conf.backup 134500174 -rw---. 1 postgres postgres 1636 May 26 22:34 pg_ident.conf.backup 68994437 drwx--. 4 postgres postgres65 Jun 1 17:34 pg_logical 134734900 -rw---. 1 postgres postgres38
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 7:12 PM Adrian Klaver wrote: > On 6/1/19 3:56 PM, Tom K wrote: > > > > > > > > > postgres=# select oid, datname from pg_database ; > >oid | datname > > ---+--- > > 13806 | postgres > > 1 | template1 > > 13805 | template0 > > (3 rows) > > > > So there are only the system databases available > > > -bash-4.2$ cd /data/patroni/ > > -bash-4.2$ ls -altri > > total 144 > > 69085037 drwxr-xr-x. 3 root root20 Oct 23 2018 .. > > 135316997 -rw---. 1 postgres postgres 206 Oct 29 2018 > > backup_label.old > > 201708781 drwx--. 2 postgres postgres 6 Oct 29 2018 > pg_commit_ts > >1502746 drwx--. 2 postgres postgres 6 Oct 29 2018 > pg_dynshmem > > 68994449 drwx--. 2 postgres postgres 6 Oct 29 2018 > pg_twophase > >1502749 drwx--. 2 postgres postgres 6 Oct 29 2018 > pg_snapshots > > 201708785 drwx--. 2 postgres postgres 6 Oct 29 2018 pg_serial > >1502747 drwx--. 4 postgres postgres34 Oct 29 2018 > pg_multixact > > 67677559 drwx--. 5 postgres postgres38 Oct 29 2018 base > > base/ is the directory you need to look in. I'm guessing it is only > going to show the oid/ for the three db's above and pgsql_tmp/ > > For more info on this see: > https://www.postgresql.org/docs/10/storage-file-layout.html > > The below looks like the RH package installed data directory. Also looks > like it either has never had initdb run against or the files where removed. > > I thought you said you had copied in data directories from the other > nodes, did I remember correctly? > > > -bash-4.2$ cd > > -bash-4.2$ cd 10 > > -bash-4.2$ ls > > backups data > > -bash-4.2$ pwd > > /var/lib/pgsql/10 > > -bash-4.2$ cd data/ > > -bash-4.2$ ls -altri > > total 0 > > 134734937 drwx--. 4 postgres postgres 31 May 8 06:25 .. > > 245519 drwx--. 2 postgres postgres 6 May 8 06:25 . > > -bash-4.2$ cd .. > > -bash-4.2$ pwd > > /var/lib/pgsql/10 > > -bash-4.2$ cd .. > > -bash-4.2$ > > Yep, you remembered correctly. I copied the files as they were, out to a temporary folder under root for each node but never dug into base/ etc any further to check things. So here's the state of things in the base/ folder of the backup of each node. [ PSQL03 ] [root@psql03 base]# ls -altri total 40 42424 drwx--. 2 postgres postgres 8192 Oct 29 2018 1 67714749 drwx--. 2 postgres postgres 8192 Oct 29 2018 13805 202037206 drwx--. 5 postgres postgres 38 Oct 29 2018 . 134312175 drwx--. 2 postgres postgres 8192 May 22 01:55 13806 89714 drwxr-xr-x. 20 root root 4096 May 22 22:43 .. [root@psql03 base]# [ PSQL02 ] [root@psql02 base]# ls -altri total 412 201426668 drwx--. 2 postgres postgres 8192 Oct 29 2018 1 743426 drwx--. 2 postgres postgres 8192 Mar 24 03:47 13805 135326327 drwx--. 2 postgres postgres 16384 Mar 24 20:15 40970 451699 drwx--. 2 postgres postgres 40960 Mar 25 19:47 16395 1441696 drwx--. 2 postgres postgres 8192 Mar 31 15:09 131137 68396137 drwx--. 2 postgres postgres 8192 Mar 31 15:09 131138 135671065 drwx--. 2 postgres postgres 8192 Mar 31 15:09 131139 204353100 drwx--. 2 postgres postgres 8192 Mar 31 15:09 131140 135326320 drwx--. 17 postgres postgres 4096 Apr 14 10:08 . 68574415 drwx--. 2 postgres postgres 12288 Apr 28 06:06 131142 288896 drwx--. 2 postgres postgres 16384 Apr 28 06:06 131141 203015232 drwx--. 2 postgres postgres 8192 Apr 28 06:06 131136 135326328 drwx--. 2 postgres postgres 40960 May 5 22:09 24586 67282461 drwx--. 2 postgres postgres 8192 May 5 22:09 13806 67640961 drwx--. 2 postgres postgres 20480 May 5 22:09 131134 203500274 drwx--. 2 postgres postgres 16384 May 5 22:09 155710 134438257 drwxr-xr-x. 20 root root 4096 May 22 01:44 .. [root@psql02 base]# pwd /root/postgres-patroni-backup/base [root@psql02 base]# [ PSQL01 ] [root@psql01 base]# ls -altri total 148 134704615 drwx--. 2 postgres postgres 8192 Oct 29 2018 1 201547700 drwx--. 2 postgres postgres 8192 Oct 29 2018 13805 160398 drwx--. 2 postgres postgres 8192 Feb 24 23:53 13806 67482137 drwx--. 7 postgres postgres62 Feb 24 23:54 . 135909671 drwx--. 2 postgres postgres 24576 Feb 24 23:54 24586 13555 drwx--. 2 postgres postgres 24576 Feb 24 23:54 16395 67178716 drwxr-xr-x. 20 root root 4096 May 22 01:53 .. [root@psql01 base]# pwd /root/postgresql-patroni-etcd/base [root@psql01 base]# Looks like this crash was far more catastrophic then I thought. By the looks of things, thinking on psql02 would be my best bet. > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver wrote: > On 6/1/19 4:22 PM, Tom K wrote: > > > > > > > > > Looks like this crash was far more catastrophic then I thought. By the > > looks of things, thinking on psql02 would be my best bet. > > > > The more I look at it the more I think the replication was not doing > what you thought it was doing. That psql02 was the primary and that > psql01 and psql03 where out of sync and/or defunct standbys. > Now that I look at the files myself, that's the conclusion I was coming to myself. Sample config: [root@psql02 base]# cat /etc/patroni.yml scope: postgres namespace: /db/ name: postgresql1 restapi: listen: 192.168.0.124:8008 connect_address: 192.168.0.124:8008 etcd: host: 192.168.0.124:2379 bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 postgresql: use_pg_rewind: true initdb: - encoding: UTF8 - data-checksums pg_hba: - host replication replicator 127.0.0.1/32 md5 - host replication replicator 192.168.0.108/0 md5 - host replication replicator 192.168.0.124/0 md5 - host replication replicator 192.168.0.118/0 md5 - host all all 0.0.0.0/0 md5 users: admin: password: admin options: - createrole - createdb postgresql: listen: 192.168.0.124:5432 bin_dir: /usr/pgsql-10/bin connect_address: 192.168.0.124:5432 data_dir: /data/patroni pgpass: /tmp/pgpass unix_socket_directories: /data/patroni authentication: replication: username: replicator password: rep-pass superuser: username: postgres password: parameters: unix_socket_directories: '.' tags: nofailover: false noloadbalance: false clonefrom: false nosync: false [root@psql02 base]# Or perhaps when the system crashed, the filesystem check simply moved the folders out due to corruption. > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: psql: FATAL: the database system is starting up
So the best bet will be trying to get through this error then: [ PSQL02 ] PANIC:replication check point has wrong magic 0 instead of 307747550 On Sat, Jun 1, 2019 at 8:21 PM Tom K wrote: > > > On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver > wrote: > >> On 6/1/19 4:22 PM, Tom K wrote: >> > >> > >> >> > >> > Looks like this crash was far more catastrophic then I thought. By the >> > looks of things, thinking on psql02 would be my best bet. >> > >> >> The more I look at it the more I think the replication was not doing >> what you thought it was doing. That psql02 was the primary and that >> psql01 and psql03 where out of sync and/or defunct standbys. >> > > Now that I look at the files myself, that's the conclusion I was coming to > myself. Sample config: > > [root@psql02 base]# cat /etc/patroni.yml > scope: postgres > namespace: /db/ > name: postgresql1 > > restapi: > listen: 192.168.0.124:8008 > connect_address: 192.168.0.124:8008 > > etcd: > host: 192.168.0.124:2379 > > bootstrap: > dcs: > ttl: 30 > loop_wait: 10 > retry_timeout: 10 > maximum_lag_on_failover: 1048576 > postgresql: > use_pg_rewind: true > > initdb: > - encoding: UTF8 > - data-checksums > > pg_hba: > - host replication replicator 127.0.0.1/32 md5 > - host replication replicator 192.168.0.108/0 md5 > - host replication replicator 192.168.0.124/0 md5 > - host replication replicator 192.168.0.118/0 md5 > - host all all 0.0.0.0/0 md5 > > users: > admin: > password: admin > options: > - createrole > - createdb > > postgresql: > listen: 192.168.0.124:5432 > bin_dir: /usr/pgsql-10/bin > connect_address: 192.168.0.124:5432 > data_dir: /data/patroni > pgpass: /tmp/pgpass > unix_socket_directories: /data/patroni > authentication: > replication: > username: replicator > password: rep-pass > superuser: > username: postgres > password: > parameters: > unix_socket_directories: '.' > > tags: > nofailover: false > noloadbalance: false > clonefrom: false > nosync: false > [root@psql02 base]# > > Or perhaps when the system crashed, the filesystem check simply moved the > folders out due to corruption. > > >> >> >> >> -- >> Adrian Klaver >> adrian.kla...@aklaver.com >> >
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 8:25 PM Tom K wrote: > So the best bet will be trying to get through this error then: > > [ PSQL02 ] > PANIC:replication check point has wrong magic 0 instead of 307747550 > > > > > On Sat, Jun 1, 2019 at 8:21 PM Tom K wrote: > >> >> >> On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver >> wrote: >> >>> On 6/1/19 4:22 PM, Tom K wrote: >>> > >>> > >>> >>> > >>> > Looks like this crash was far more catastrophic then I thought. By >>> the >>> > looks of things, thinking on psql02 would be my best bet. >>> > >>> >>> The more I look at it the more I think the replication was not doing >>> what you thought it was doing. That psql02 was the primary and that >>> psql01 and psql03 where out of sync and/or defunct standbys. >>> >> >> Now that I look at the files myself, that's the conclusion I was coming >> to myself. Sample config: >> >> [root@psql02 base]# cat /etc/patroni.yml >> scope: postgres >> namespace: /db/ >> name: postgresql1 >> >> restapi: >> listen: 192.168.0.124:8008 >> connect_address: 192.168.0.124:8008 >> >> etcd: >> host: 192.168.0.124:2379 >> >> bootstrap: >> dcs: >> ttl: 30 >> loop_wait: 10 >> retry_timeout: 10 >> maximum_lag_on_failover: 1048576 >> postgresql: >> use_pg_rewind: true >> >> initdb: >> - encoding: UTF8 >> - data-checksums >> >> pg_hba: >> - host replication replicator 127.0.0.1/32 md5 >> - host replication replicator 192.168.0.108/0 md5 >> - host replication replicator 192.168.0.124/0 md5 >> - host replication replicator 192.168.0.118/0 md5 >> - host all all 0.0.0.0/0 md5 >> >> users: >> admin: >> password: admin >> options: >> - createrole >> - createdb >> >> postgresql: >> listen: 192.168.0.124:5432 >> bin_dir: /usr/pgsql-10/bin >> connect_address: 192.168.0.124:5432 >> data_dir: /data/patroni >> pgpass: /tmp/pgpass >> unix_socket_directories: /data/patroni >> authentication: >> replication: >> username: replicator >> password: rep-pass >> superuser: >> username: postgres >> password: >> parameters: >> unix_socket_directories: '.' >> >> tags: >> nofailover: false >> noloadbalance: false >> clonefrom: false >> nosync: false >> [root@psql02 base]# >> >> Or perhaps when the system crashed, the filesystem check simply moved the >> folders out due to corruption. >> >> >>> >>> Trying what we did above but on the second node: 2019-06-01 20:28:45.305 EDT [21745] LOG: database system was interrupted; last known up at 2019-04-28 06:06:24 EDT 2019-06-01 20:28:46.142 EDT [21745] DEBUG: primary_slot_name = 'postgresql1' 2019-06-01 20:28:46.142 EDT [21745] DEBUG: standby_mode = 'on' 2019-06-01 20:28:46.142 EDT [21745] DEBUG: recovery_target_timeline = latest 2019-06-01 20:28:46.142 EDT [21745] WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command 2019-06-01 20:28:46.142 EDT [21745] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there. 2019-06-01 20:28:46.142 EDT [21745] LOG: entering standby mode 2019-06-01 20:28:46.142 EDT [21745] LOG: invalid record length at 0/4C35CDF8: wanted 24, got 0 2019-06-01 20:28:46.142 EDT [21745] LOG: invalid primary checkpoint record 2019-06-01 20:28:46.142 EDT [21745] LOG: using previous checkpoint record at 0/4C34EDA8 2019-06-01 20:28:46.142 EDT [21745] DEBUG: redo record is at 0/4C34ED70; shutdown FALSE 2019-06-01 20:28:46.143 EDT [21745] DEBUG: next transaction ID: 0:1409831; next OID: 237578 2019-06-01 20:28:46.143 EDT [21745] DEBUG: next MultiXactId: 48; next MultiXactOffset: 174 2019-06-01 20:28:46.143 EDT [21745] DEBUG: oldest unfrozen transaction ID: 549, in database 1 2019-06-01 20:28:46.143 EDT [21745] DEBUG: oldest MultiXactId: 1, in database 1 2019-06-01 20:28:46.143 EDT [21745] DEBUG: commit timestamp Xid oldest/newest: 0/0 2019-06-01 20:28:46.143 EDT [21745] DEBUG: transaction ID wrap limit is 2147484196, limited by database with OID 1 2019-06-01 20:28:46.143 EDT [21745] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID
Re: psql: FATAL: the database system is starting up
ect as any PostgreSQL user, including # the database superuser. If you do not trust all your local users, # use another authentication method. # TYPE DATABASEUSERADDRESS METHOD # "local" is for Unix domain socket connections only local all all trust # IPv4 local connections: hostall all 127.0.0.1/32trust # IPv6 local connections: hostall all ::1/128 trust # Allow replication connections from localhost, by a user with the # replication privilege. local replication all trust hostreplication all 127.0.0.1/32trust host replication all ::1/128 trust host replication replicator 127.0.0.1/32 md5 host replication replicator 192.168.0.108/0 md5 host replication replicator 192.168.0.124/0 md5 host replication replicator 192.168.0.118/0 md5 host all all 0.0.0.0/0 md5 -bash-4.2$ pwd /data/patroni -bash-4.2$ -bash-4.2$ cat postgresql.base.conf|grep -v "#"|sed -e "/^$/d" log_timezone = 'Canada/Eastern' datestyle = 'iso, mdy' timezone = 'Canada/Eastern' default_text_search_config = 'pg_catalog.english' -bash-4.2$ On Sat, Jun 1, 2019 at 8:36 PM Adrian Klaver wrote: > On 6/1/19 5:21 PM, Tom K wrote: > > > > > > On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver > <mailto:adrian.kla...@aklaver.com>> wrote: > > > > On 6/1/19 4:22 PM, Tom K wrote: > > > > > > > > > > > > > > Looks like this crash was far more catastrophic then I thought. > > By the > > > looks of things, thinking on psql02 would be my best bet. > > > > > > > The more I look at it the more I think the replication was not doing > > what you thought it was doing. That psql02 was the primary and that > > psql01 and psql03 where out of sync and/or defunct standbys. > > > > > > Now that I look at the files myself, that's the conclusion I was coming > > to myself. Sample config: > > The below would be for someone that uses and understands Patroni. That > would not be me:) > > > > > [root@psql02 base]# cat /etc/patroni.yml > > scope: postgres > > namespace: /db/ > > name: postgresql1 > > > > restapi: > > listen: 192.168.0.124:8008 <http://192.168.0.124:8008> > > connect_address: 192.168.0.124:8008 <http://192.168.0.124:8008> > > > > > Or perhaps when the system crashed, the filesystem check simply moved > > the folders out due to corruption. > > That would leave the cluster in an inconsistent state and you would not > have been able to start the one you got going. > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: psql: FATAL: the database system is starting up
On Sat, Jun 1, 2019 at 8:53 PM Adrian Klaver wrote: > On 6/1/19 5:32 PM, Tom K wrote: > > > > > > > Trying what we did above but on the second node: > > > > Was this node the primary? > > To me the below looks like there are replication slots set up that are > failing. Not sure how to deal with this at the moment. You might try > single-user mode: > > https://www.postgresql.org/docs/10/app-postgres.html > Single-User Mode > ... > > and see if that at least gets the server started. This is a highly > restricted so do not expect much usability. > These servers did crash before however didn't' notice anything particularly wrong at the time. Perhaps a better check was in order. Here is the result from attempting to start the DB in single user mode. -bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=off --listen_addresses=192.168.0.124 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=20 -d 5 2>&1 2019-06-01 23:04:20.876 EDT [31831] DEBUG: invoking IpcMemoryCreate(size=144687104) 2019-06-01 23:04:20.877 EDT [31831] DEBUG: mmap(144703488) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory 2019-06-01 23:04:20.890 EDT [31831] DEBUG: SlruScanDirectory invoking callback on pg_notify/ 2019-06-01 23:04:20.890 EDT [31831] DEBUG: removing file "pg_notify/" 2019-06-01 23:04:20.890 EDT [31831] DEBUG: dynamic shared memory system will support 128 segments 2019-06-01 23:04:20.891 EDT [31831] DEBUG: created dynamic shared memory control segment 846930886 (3088 bytes) 2019-06-01 23:04:20.891 EDT [31831] DEBUG: InitPostgres 2019-06-01 23:04:20.891 EDT [31831] DEBUG: my backend ID is 1 2019-06-01 23:04:20.891 EDT [31831] LOG: database system was interrupted; last known up at 2019-04-28 06:06:24 EDT 2019-06-01 23:04:21.239 EDT [31831] LOG: invalid record length at 0/4C35CDF8: wanted 24, got 0 2019-06-01 23:04:21.239 EDT [31831] LOG: invalid primary checkpoint record 2019-06-01 23:04:21.239 EDT [31831] LOG: using previous checkpoint record at 0/4C34EDA8 2019-06-01 23:04:21.239 EDT [31831] DEBUG: redo record is at 0/4C34ED70; shutdown FALSE 2019-06-01 23:04:21.239 EDT [31831] DEBUG: next transaction ID: 0:1409831; next OID: 237578 2019-06-01 23:04:21.239 EDT [31831] DEBUG: next MultiXactId: 48; next MultiXactOffset: 174 2019-06-01 23:04:21.239 EDT [31831] DEBUG: oldest unfrozen transaction ID: 549, in database 1 2019-06-01 23:04:21.239 EDT [31831] DEBUG: oldest MultiXactId: 1, in database 1 2019-06-01 23:04:21.239 EDT [31831] DEBUG: commit timestamp Xid oldest/newest: 0/0 2019-06-01 23:04:21.239 EDT [31831] DEBUG: transaction ID wrap limit is 2147484196, limited by database with OID 1 2019-06-01 23:04:21.239 EDT [31831] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1 2019-06-01 23:04:21.239 EDT [31831] DEBUG: starting up replication slots 2019-06-01 23:04:21.239 EDT [31831] DEBUG: starting up replication origin progress state 2019-06-01 23:04:21.239 EDT [31831] PANIC: replication checkpoint has wrong magic 0 instead of 307747550 Aborted -bash-4.2$ -bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni --config-file=/data/patroni/postgresql.conf 2019-06-01 23:06:28.679 EDT [31969] LOG: database system was interrupted; last known up at 2019-04-28 06:06:24 EDT 2019-06-01 23:06:28.953 EDT [31969] LOG: invalid record length at 0/4C35CDF8: wanted 24, got 0 2019-06-01 23:06:28.953 EDT [31969] LOG: invalid primary checkpoint record 2019-06-01 23:06:28.953 EDT [31969] LOG: using previous checkpoint record at 0/4C34EDA8 2019-06-01 23:06:28.953 EDT [31969] PANIC: replication checkpoint has wrong magic 0 instead of 307747550 Aborted -bash-4.2$ > > > Removing the recovery.conf and restarting yields the following messages: > > > > 2019-06-01 20:31:12.231 EDT [21910] LOG: database system was > > interrupted; last known up at 2019-04-28 06:06:24 EDT > > 2019-06-01 20:31:12.505 EDT [21910] LOG: invalid record length at > > 0/4C35CDF8: wanted 24, got 0 > > 2019-06-01 20:31:12.505 EDT [21910] LOG: invalid primary checkpoint > record > > 2019-06-01 20:31:12.505 EDT [21910] LOG: using previous checkpoint > > record at 0/4C34EDA8 > > 2019-06-01 20:31:12.505 EDT [21910] DEBUG: redo record is at > > 0/4C34ED70; shutdown FALSE > > 2019-06-01 20:31:12.505 EDT [21910] DEBUG: next transaction ID: > > 0:1409831; next OID: 237578 > > 2019-06-01 20:31:12.505 EDT [21910] DEBUG: next MultiXactId: 48; next > > MultiXactOffset: 174 > > 2019-06-01 20:31:12.505 EDT [21910] DEBUG: oldest unfr
Re: psql: FATAL: the database system is starting up
On Sun, Jun 2, 2019 at 11:47 AM Adrian Klaver wrote: > On 6/1/19 8:07 PM, Tom K wrote: > > > > > https://www.postgresql.org/docs/10/app-postgres.html > > Single-User Mode > > ... > > > > and see if that at least gets the server started. This is a highly > > restricted so do not expect much usability. > > > > > > These servers did crash before however didn't' notice anything > > particularly wrong at the time. Perhaps a better check was in order. > > > > Here is the result from attempting to start the DB in single user mode. > > > -bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni > > --config-file=/data/patroni/postgresql.conf > > 2019-06-01 23:06:28.679 EDT [31969] LOG: database system was > > interrupted; last known up at 2019-04-28 06:06:24 EDT > > 2019-06-01 23:06:28.953 EDT [31969] LOG: invalid record length at > > 0/4C35CDF8: wanted 24, got 0 > > 2019-06-01 23:06:28.953 EDT [31969] LOG: invalid primary checkpoint > record > > 2019-06-01 23:06:28.953 EDT [31969] LOG: using previous checkpoint > > record at 0/4C34EDA8 > > 2019-06-01 23:06:28.953 EDT [31969] PANIC: replication checkpoint has > > wrong magic 0 instead of 307747550 > > That comes from origin.c in the logical replication section of the code. > > Did you have logical replication set up? > Nope. wal_level was set to replica, not logical. Unless you mean > What was the role of this cluster in the original setup? > The cluster was the backend database for a number of applications. The aim was to point applications to a single large cluster instead of a number of small instances of postgres running all over the LAB. If I can get through the above error and get to the DB's and tables, I could just dump them out and reinit the entire DB from the backup. > > > Aborted > > -bash-4.2$ > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: psql: FATAL: the database system is starting up
Hey Adrian, Fixed it. I saw the post from jebriggs but that didn't work for me so posted here. Anyway, here's how I resolved it: When I ran an strace on the postgres startup line, I got this: open("pg_logical/replorigin_checkpoint", O_RDONLY) = 6 write(2, "2019-06-02 14:50:34.777 EDT [283"..., 1062019-06-02 14:50:34.777 EDT [28362] PANIC: replication checkpoint has wrong magic 0 instead of 307747550 -bash-4.2$ Notice how it's reading *pg_logical/replorigin_checkpoint* even though wal_level = 'replicate' . It sure looks like it's doing logical replication. Anyway, I checked the file and saw this: -bash-4.2$ ls -altri pg_logical/ total 20 67894871 -rw---. 1 postgres postgres 16384 Oct 29 2018 replorigin_checkpoint 136946383 drwx--. 2 postgres postgres 6 Oct 29 2018 snapshots 204367784 drwx--. 2 postgres postgres 6 Oct 29 2018 mappings 67894870 drwx--. 4 postgres postgres65 Apr 28 06:06 . 135326272 drwx--. 21 postgres postgres 4096 Jun 2 14:50 .. -bash-4.2$ Notice the dates and size. All way off. Now I checked the same file on the other nodes: [root@psql03 pg_logical]# ls -altri total 8 68994432 drwx--. 2 postgres postgres6 Oct 29 2018 snapshots 134984156 drwx--. 2 postgres postgres6 Oct 29 2018 mappings 566745 -rw---. 1 postgres postgres8 May 22 01:55 replorigin_checkpoint 566731 drwx--. 4 postgres postgres 65 May 22 01:55 . 89714 drwxr-xr-x. 20 root root 4096 May 22 22:43 .. [root@psql03 pg_logical]# Dates of the other hosts replorigin_checkpoint are much more recent and smaller. So I take the replorigin_checkpoint and copy it over to the former primary node ( From psql03 to psql02(primary) ) and try a restart. Everything started up. Replication started to the other postgres nodes without issues. Verified base/ folder on all nodes to ensure replication is working. Every table is back to the way it was before this entire outage. Adrian, thank you very much for all the help here. It was much appreciated. I've written this up and posted it at the following two links so others can benefit and I don't forget. https://microdevsys.com/wp/panic-replication-checkpoint-has-wrong-magic-0-instead-of-307747550/ And of course a handy way to do backups, which I should have implemented before this happened: https://microdevsys.com/wp/postgresql-pull-backup/ Thx, TK On Sun, Jun 2, 2019 at 4:48 PM Adrian Klaver wrote: > On 6/2/19 11:14 AM, Tom K wrote: > > > > > Nope. wal_level was set to replica, not logical. Unless you mean > > > > > > What was the role of this cluster in the original setup? > > > > > > The cluster was the backend database for a number of applications. The > > aim was to point applications to a single large cluster instead of a > > number of small instances of postgres running all over the LAB. > > > > If I can get through the above error and get to the DB's and tables, I > > could just dump them out and reinit the entire DB from the backup. > > The only thing I could find is the below: > > > http://www.jebriggs.com/blog/2018/05/postgresql-and-panic-replication-checkpoint-has-wrong-magic-error/ > > If that does not work I would suggest sending a new post(new subject) to > the mailing list based on: > > replication checkpoint has wrong magic 0 instead of 307747550 > > > -- > Adrian Klaver > adrian.kla...@aklaver.com > > >