Re: psql: FATAL: the database system is starting up

2019-05-28 Thread Tom K
On Tue, May 28, 2019 at 9:53 AM Adrian Klaver 
wrote:

> On 5/27/19 9:59 PM, Tom K wrote:
> > Hey Guy's,
> >
> >
> > I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3
> > RHEL 7.6 VM's.  Every now and then the underlying storage crashes taking
> > out the cluster.  On recovery, PostgreSQL tends to come up while other
> > databases just blow up.  That is until now.
>
> The above is one master and two standby servers connected via streaming
> replication, correct?


> Is there WAL archiving set up?
>

Correct.  Master election occurs through Patroni.  WAL level is set to:

wal_level = 'replica'

So no archiving.


>
>
> >
> > After the most recent crash 2-3 weeks ago, the cluster is now running
> > into this message but I'm not able to make heads or tails out of why
> > it's throwing this:
>
> So you have not been able to run the cluster the past 2-3 weeks or is
> that  more recent?
>

Haven't been able to bring this PostgresSQL cluster up ( run the cluster )
since 2-3 weeks ago.  Tried quite a few combinations of options to recover
this.  No luck.  Had storage failures earlier, even with corrupted OS
files, but this PostgreSQL cluster w/ Patroni was able to come up each time
without any recovery effort on my part.


>
> When you refer to history files below are you talking about WAL files or
> something else?
>
> Is this:
>
> "recovery command file "recovery.conf" specified neither
> primary_conninfo nor restore_command"
>
> true?
>

True. recovery.conf is controlled by Patroni.  Contents of this file
remained the same for all the cluster nodes with the exception of the
primary_slot_name:

[root@psql01 postgresql-patroni-etcd]# cat recovery.conf
primary_slot_name = 'postgresql0'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql01 postgresql-patroni-etcd]#

[root@psql02 postgres-backup]# cat recovery.conf
primary_slot_name = 'postgresql1'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql02 postgres-backup]#

[root@psql03 postgresql-patroni-backup]# cat recovery.conf
primary_slot_name = 'postgresql2'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql03 postgresql-patroni-backup]#

I've made a copy of the root postgres directory over to another location so
when troubleshooting, I can always revert to the first state the cluster
was in when it failed.

Thx,
TK


>
>
>
> >
> > |-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
> > --config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT
> > [10101]LOG:listening onIPv4 address "192.168.0.108",port
> > 54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket
> > "./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log
> > output tologging collector process 2019-05-2308:26:34.591EDT
> > [10101]HINT:Future log output will appear indirectory
> > "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni
> > --config-file=/data/patroni/postgresql.conf --hot_standby=on
> > --listen_addresses=192.168.0.108 --max_worker_processes=8
> > --max_locks_per_transaction=64 --wal_level=replica
> > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> > --max_replication_slots=10
> > --max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening
> > onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT
> > [10105]LOG:listening onUnix socket
> > "./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log
> > output tologging collector process 2019-05-2308:26:45.740EDT
> > [10105]HINT:Future log output will appear indirectory
> > "log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
> > --config-file=/data/patroni/postgresql.conf --hot_standby=on
> > --listen_addresses=192.168.0.1 --max_worker_processes=8
> > --max_locks_per_transaction=64 --wal_level=replica
> > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> > --max_replication_slots=10
> > --max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening
> > onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT
> > [10188]LOG:listening onUnix socket
> > "./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log
> > output tologging collector process 2019-05-2308:29:36.040EDT
> > [10188]HINT:Future log output

Re: psql: FATAL: the database system is starting up

2019-05-31 Thread Tom K
On Wed, May 29, 2019 at 10:28 AM Adrian Klaver 
wrote:

> On 5/28/19 6:59 PM, Tom K wrote:
> >
> >
> > On Tue, May 28, 2019 at 9:53 AM Adrian Klaver  > <mailto:adrian.kla...@aklaver.com>> wrote:
> >
>
> >
> > Correct.  Master election occurs through Patroni.  WAL level is set to:
> >
> > wal_level = 'replica'
> >
> > So no archiving.
> >
> >
> >
> >  >
> >  > After the most recent crash 2-3 weeks ago, the cluster is now
> > running
> >  > into this message but I'm not able to make heads or tails out of
> why
> >  > it's throwing this:
> >
> > So you have not been able to run the cluster the past 2-3 weeks or is
> > that  more recent?
> >
> >
> > Haven't been able to bring this PostgresSQL cluster up ( run the cluster
> > ) since 2-3 weeks ago.  Tried quite a few combinations of options to
> > recover this.  No luck.  Had storage failures earlier, even with
> > corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to
> > come up each time without any recovery effort on my part.
> >
> >
> > When you refer to history files below are you talking about WAL
> > files or
> > something else?
> >
> > Is this:
> >
> > "recovery command file "recovery.conf" specified neither
> > primary_conninfo nor restore_command"
> >
> > true?
> >
> >
> > True. recovery.conf is controlled by Patroni.  Contents of this file
> > remained the same for all the cluster nodes with the exception of the
> > primary_slot_name:
> >
> > [root@psql01 postgresql-patroni-etcd]# cat recovery.conf
> > primary_slot_name = 'postgresql0'
> > standby_mode = 'on'
> > recovery_target_timeline = 'latest'
> > [root@psql01 postgresql-patroni-etcd]#
> >
> > [root@psql02 postgres-backup]# cat recovery.conf
> > primary_slot_name = 'postgresql1'
> > standby_mode = 'on'
> > recovery_target_timeline = 'latest'
> > [root@psql02 postgres-backup]#
> >
> > [root@psql03 postgresql-patroni-backup]# cat recovery.conf
> > primary_slot_name = 'postgresql2'
> > standby_mode = 'on'
> > recovery_target_timeline = 'latest'
> > [root@psql03 postgresql-patroni-backup]#
> >
> > I've made a copy of the root postgres directory over to another location
> > so when troubleshooting, I can always revert to the first state the
> > cluster was in when it failed.
>
> I have no experience with Patroni so I will be of no help there. You
> might get more useful information from:
>
> https://github.com/zalando/patroni
> Community
>
> There are two places to connect with the Patroni community: on github,
> via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If
> you're using Patroni, or just interested, please join us.
>

Will post there as well.  Thank you.  My thinking was to post here first
since I suspect the Patroni community will simply refer me back here given
that the PostgreSQL errors are originating directly from PostgreSQL.


>
> That being said, can you start the copied Postgres instance without
> using the Patroni instrumentation?
>

Yes, that is something I have been trying to do actually.  But I hit a dead
end with the three errors above.

So what I did is to copy a single node's backed up copy of the data files
to */data/patroni* of the same node ( this is the psql data directory as
defined through patroni ) of the same node then ran this ( psql03 =
192.168.0.118 ):

# sudo su - postgres
$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf
--listen_addresses=192.168.0.118 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100 --hot_standby=on
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

This resulted in one of the 3 messages above.  Hence the post here.  If I
can start a single instance, I should be fine since I could then 1)
replicate over to the other two or 2) simply take a dump, reinitialize all
the databases then restore the dump.

Using the above procedure I get one of three error messages when using the
data files of each node:

[ PSQL01 ]
postgres: postgres: startup process waiting for 00010008

[ PSQL02 ]
PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550

[ PSQL03 }
FATAL:syntax error inhistory file:f2W

And I can't start any one of them.


>
> >
> > Thx,
> > TK
> >
>
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 9:55 AM Adrian Klaver 
wrote:

> On 5/31/19 7:53 PM, Tom K wrote:
> >
>
> > There are two places to connect with the Patroni community: on
> github,
> > via Issues and PRs, and on channel #patroni in the PostgreSQL Slack.
> If
> > you're using Patroni, or just interested, please join us.
> >
> >
> > Will post there as well.  Thank you.  My thinking was to post here first
> > since I suspect the Patroni community will simply refer me back here
> > given that the PostgreSQL errors are originating directly from
> PostgreSQL.
> >
> >
> > That being said, can you start the copied Postgres instance without
> > using the Patroni instrumentation?
> >
> >
> > Yes, that is something I have been trying to do actually.  But I hit a
> > dead end with the three errors above.
> >
> > So what I did is to copy a single node's backed up copy of the data
> > files to */data/patroni* of the same node ( this is the psql data
> > directory as defined through patroni ) of the same node then ran this (
> > psql03 = 192.168.0.118 ):
> >
> > # sudo su - postgres
> > $ /usr/pgsql-10/bin/postgres -D /data/patroni
> > --config-file=/data/patroni/postgresql.conf
> > --listen_addresses=192.168.0.118 --max_worker_processes=8
> > --max_locks_per_transaction=64 --wal_level=replica
> > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> > --max_replication_slots=10 --max_connections=100 --hot_standby=on
> > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5
>
> Why all the options?
> That should be covered in postgresql.conf, no?
>
> >
> > This resulted in one of the 3 messages above.  Hence the post here.  If
> > I can start a single instance, I should be fine since I could then 1)
> > replicate over to the other two or 2) simply take a dump, reinitialize
> > all the databases then restore the dump.
> >
>
> What if you move the recovery.conf file out?


Will try.


>
> The below looks like missing/corrupted/incorrect files. Hard to tell
> without knowing what Patroni did?


Storage disappeared from underneath these clusters.  The OS was of course
still in memory making futile attempts to write to disk, which would never
complete.

My best guess is that Patroni or postgress was in the middle of some writes
across the clusters when the failure occurred.


>
> > Using the above procedure I get one of three error messages when using
> > the data files of each node:
> >
> > [ PSQL01 ]
> > postgres: postgres: startup process waiting for 00010008
> >
> > [ PSQL02 ]
> > PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550
> >
> > [ PSQL03 }
> > FATAL:syntax error inhistory file:f2W
> >
> > And I can't start any one of them.
> >
> >
> >
> >  >
> >  > Thx,
> >  > TK
> >  >
> >
> >
> >
> > --
> > Adrian Klaver
> > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 3:32 PM Tom K  wrote:

>
>
> On Sat, Jun 1, 2019 at 9:55 AM Adrian Klaver 
> wrote:
>
>> On 5/31/19 7:53 PM, Tom K wrote:
>> >
>>
>> > There are two places to connect with the Patroni community: on
>> github,
>> > via Issues and PRs, and on channel #patroni in the PostgreSQL
>> Slack. If
>> > you're using Patroni, or just interested, please join us.
>> >
>> >
>> > Will post there as well.  Thank you.  My thinking was to post here
>> first
>> > since I suspect the Patroni community will simply refer me back here
>> > given that the PostgreSQL errors are originating directly from
>> PostgreSQL.
>> >
>> >
>> > That being said, can you start the copied Postgres instance without
>> > using the Patroni instrumentation?
>> >
>> >
>> > Yes, that is something I have been trying to do actually.  But I hit a
>> > dead end with the three errors above.
>> >
>> > So what I did is to copy a single node's backed up copy of the data
>> > files to */data/patroni* of the same node ( this is the psql data
>> > directory as defined through patroni ) of the same node then ran this (
>> > psql03 = 192.168.0.118 ):
>> >
>> > # sudo su - postgres
>> > $ /usr/pgsql-10/bin/postgres -D /data/patroni
>> > --config-file=/data/patroni/postgresql.conf
>> > --listen_addresses=192.168.0.118 --max_worker_processes=8
>> > --max_locks_per_transaction=64 --wal_level=replica
>> > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
>> > --max_replication_slots=10 --max_connections=100 --hot_standby=on
>> > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5
>>
>> Why all the options?
>> That should be covered in postgresql.conf, no?
>>
>> >
>> > This resulted in one of the 3 messages above.  Hence the post here.  If
>> > I can start a single instance, I should be fine since I could then 1)
>> > replicate over to the other two or 2) simply take a dump, reinitialize
>> > all the databases then restore the dump.
>> >
>>
>> What if you move the recovery.conf file out?
>
>
> Will try.
>
>
>>
>> The below looks like missing/corrupted/incorrect files. Hard to tell
>> without knowing what Patroni did?
>
>
> Storage disappeared from underneath these clusters.  The OS was of course
> still in memory making futile attempts to write to disk, which would never
> complete.
>
> My best guess is that Patroni or postgress was in the middle of some
> writes across the clusters when the failure occurred.
>

Of note are the characters f2W below.  I see nothing in the postgres source
code to indicate this is any recognizable postgres message.  A part of me
suspects that the postgres binaries got corrupted.   Had this case occur
with glib-common and a reinstall fixed it.  However the postgres binaries
csum matches a standalone install perfectly so that should not be an issue.



>
>>
>> > Using the above procedure I get one of three error messages when using
>> > the data files of each node:
>> >
>> > [ PSQL01 ]
>> > postgres: postgres: startup process waiting for 00010008
>> >
>> > [ PSQL02 ]
>> > PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550
>> >
>> > [ PSQL03 }
>> > FATAL:syntax error inhistory file:f2W
>> >
>> > And I can't start any one of them.
>> >
>> >
>> >
>> >  >
>> >  > Thx,
>> >  > TK
>> >  >
>> >
>> >
>> >
>> > --
>> > Adrian Klaver
>> > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>
>> >
>>
>>
>> --
>> Adrian Klaver
>> adrian.kla...@aklaver.com
>>
>


Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 4:11 PM Adrian Klaver 
wrote:

> On 6/1/19 12:32 PM, Tom K wrote:
> >
> >
>
> >
> > What if you move the recovery.conf file out?
> >
> >
> > Will try.
> >
> >
> >
> > The below looks like missing/corrupted/incorrect files. Hard to tell
> > without knowing what Patroni did?
> >
> >
> > Storage disappeared from underneath these clusters.  The OS was of
> > course still in memory making futile attempts to write to disk, which
> > would never complete.
> >
> > My best guess is that Patroni or postgress was in the middle of some
> > writes across the clusters when the failure occurred.
>
> So to be clear all three clusters where writing to the same storage
> medium and there was no WAL archiving to some other storage?


Yep, cheap LAB hardware with no power redundancy ( yet ) .


>
> >
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 4:52 PM Adrian Klaver 
wrote:

> On 6/1/19 12:42 PM, Tom K wrote:
> >
> >
>
> >
> > Of note are the characters f2W below.  I see nothing in the postgres
> > source code to indicate this is any recognizable postgres message.  A
> > part of me suspects that the postgres binaries got corrupted.   Had this
> > case occur with glib-common and a reinstall fixed it.  However the
> > postgres binaries csum matches a standalone install perfectly so that
> > should not be an issue.
>
> It comes from timeline.c:
>
> https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html
>
> pg_log_error("syntax error in history file: %s", fline);
>
> ...
>
> There should be another error message after the above.
>

Nope.  Here's the full set of lines in the postgres logs when running the
above line:

2019-06-01 17:13:03.261 EDT [14909] LOG:  database system was shut down at
2019-05-22 01:55:13 EDT
2019-06-01 17:13:03.261 EDT [14909] DEBUG:  primary_slot_name =
'postgresql2'
2019-06-01 17:13:03.261 EDT [14909] DEBUG:  standby_mode = 'on'
2019-06-01 17:13:03.261 EDT [14909] DEBUG:  recovery_target_timeline =
latest
2019-06-01 17:13:03.261 EDT [14909] WARNING:  recovery command file
"recovery.conf" specified neither primary_conninfo nor restore_command
2019-06-01 17:13:03.261 EDT [14909] HINT:  The database server will
regularly poll the pg_wal subdirectory to check for files placed there.
2019-06-01 17:13:03.263 EDT [14909] LOG:  entering standby mode
2019-06-01 17:13:03.263 EDT [14909] FATAL:  syntax error in history file:
f2W
2019-06-01 17:13:03.263 EDT [14909] HINT:  Expected a numeric timeline ID.
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  shmem_exit(1): 1
before_shmem_exit callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  shmem_exit(1): 4 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  proc_exit(1): 2 callbacks to
make
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  exit(1)
2019-06-01 17:13:03.264 EDT [14909] DEBUG:  shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.264 EDT [14909] DEBUG:  shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.264 EDT [14909] DEBUG:  proc_exit(-1): 0 callbacks to
make
2019-06-01 17:13:03.265 EDT [14907] DEBUG:  reaping dead processes
2019-06-01 17:13:03.265 EDT [14907] LOG:  startup process (PID 14909)
exited with exit code 1
2019-06-01 17:13:03.265 EDT [14907] LOG:  aborting startup due to startup
process failure
2019-06-01 17:13:03.265 EDT [14907] DEBUG:  shmem_exit(1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.266 EDT [14907] DEBUG:  shmem_exit(1): 5 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.266 EDT [14907] DEBUG:  cleaning up dynamic shared
memory control segment with ID 1669075945
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  proc_exit(1): 2 callbacks to
make
2019-06-01 17:13:03.268 EDT [14907] LOG:  database system is shut down
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  exit(1)
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  proc_exit(-1): 0 callbacks to
make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  logger shutting down
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(0): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(0): 0 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  proc_exit(0): 0 callbacks to
make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  exit(0)
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  proc_exit(-1): 0 callbacks to
make
^C
-bash-4.2$

What's interesting is that f2W isn't a string you'd expect to be printed as
part of the code logic   ( I mean, what is f2W? ).

postgresql10-server-10.8-1PGDG.rhel7.x86_64
postgresql10-libs-10.8-1PGDG.rhel7.x86_64
postgresql10-contrib-10.8-1PGDG.rhel7.x86_64
postgresql10-10.8-1PGDG.rhel7.x86_64

Without the recovery.conf file it actually started up.  Going to try to get
a dump then continue to try and recover in place.

2019-06-01 17:17:29.777 EDT [14936] DEBUG:  StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.777 EDT [14936] DEBUG:  CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.778 EDT [14937] DEBUG:  received inquiry for database 0
2019-06-01 17:17:29.778 EDT [14937] DEBUG:  writing stats file
"pg_stat_tmp/global.stat"
2019-06-01 17:17:29.779 EDT [14937] DEBUG:  writing

Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 5:30 PM Tom K  wrote:

>
>
> On Sat, Jun 1, 2019 at 4:52 PM Adrian Klaver 
> wrote:
>
>> On 6/1/19 12:42 PM, Tom K wrote:
>> >
>> >
>>
>> >
>> > Of note are the characters f2W below.  I see nothing in the postgres
>> > source code to indicate this is any recognizable postgres message.  A
>> > part of me suspects that the postgres binaries got corrupted.   Had
>> this
>> > case occur with glib-common and a reinstall fixed it.  However the
>> > postgres binaries csum matches a standalone install perfectly so that
>> > should not be an issue.
>>
>> It comes from timeline.c:
>>
>> https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html
>>
>> pg_log_error("syntax error in history file: %s", fline);
>>
>> ...
>>
>> There should be another error message after the above.
>>
>
> Nope.  Here's the full set of lines in the postgres logs when running the
> above line:
>
> 2019-06-01 17:13:03.261 EDT [14909] LOG:  database system was shut down at
> 2019-05-22 01:55:13 EDT
> 2019-06-01 17:13:03.261 EDT [14909] DEBUG:  primary_slot_name =
> 'postgresql2'
> 2019-06-01 17:13:03.261 EDT [14909] DEBUG:  standby_mode = 'on'
> 2019-06-01 17:13:03.261 EDT [14909] DEBUG:  recovery_target_timeline =
> latest
> 2019-06-01 17:13:03.261 EDT [14909] WARNING:  recovery command file
> "recovery.conf" specified neither primary_conninfo nor restore_command
> 2019-06-01 17:13:03.261 EDT [14909] HINT:  The database server will
> regularly poll the pg_wal subdirectory to check for files placed there.
> 2019-06-01 17:13:03.263 EDT [14909] LOG:  entering standby mode
> 2019-06-01 17:13:03.263 EDT [14909] FATAL:  syntax error in history file:
> f2W
> 2019-06-01 17:13:03.263 EDT [14909] HINT:  Expected a numeric timeline ID.
> 2019-06-01 17:13:03.263 EDT [14909] DEBUG:  shmem_exit(1): 1
> before_shmem_exit callbacks to make
> 2019-06-01 17:13:03.263 EDT [14909] DEBUG:  shmem_exit(1): 4 on_shmem_exit
> callbacks to make
> 2019-06-01 17:13:03.263 EDT [14909] DEBUG:  proc_exit(1): 2 callbacks to
> make
> 2019-06-01 17:13:03.263 EDT [14909] DEBUG:  exit(1)
> 2019-06-01 17:13:03.264 EDT [14909] DEBUG:  shmem_exit(-1): 0
> before_shmem_exit callbacks to make
> 2019-06-01 17:13:03.264 EDT [14909] DEBUG:  shmem_exit(-1): 0
> on_shmem_exit callbacks to make
> 2019-06-01 17:13:03.264 EDT [14909] DEBUG:  proc_exit(-1): 0 callbacks to
> make
> 2019-06-01 17:13:03.265 EDT [14907] DEBUG:  reaping dead processes
> 2019-06-01 17:13:03.265 EDT [14907] LOG:  startup process (PID 14909)
> exited with exit code 1
> 2019-06-01 17:13:03.265 EDT [14907] LOG:  aborting startup due to startup
> process failure
> 2019-06-01 17:13:03.265 EDT [14907] DEBUG:  shmem_exit(1): 0
> before_shmem_exit callbacks to make
> 2019-06-01 17:13:03.266 EDT [14907] DEBUG:  shmem_exit(1): 5 on_shmem_exit
> callbacks to make
> 2019-06-01 17:13:03.266 EDT [14907] DEBUG:  cleaning up dynamic shared
> memory control segment with ID 1669075945
> 2019-06-01 17:13:03.268 EDT [14907] DEBUG:  proc_exit(1): 2 callbacks to
> make
> 2019-06-01 17:13:03.268 EDT [14907] LOG:  database system is shut down
> 2019-06-01 17:13:03.268 EDT [14907] DEBUG:  exit(1)
> 2019-06-01 17:13:03.268 EDT [14907] DEBUG:  shmem_exit(-1): 0
> before_shmem_exit callbacks to make
> 2019-06-01 17:13:03.268 EDT [14907] DEBUG:  shmem_exit(-1): 0
> on_shmem_exit callbacks to make
> 2019-06-01 17:13:03.268 EDT [14907] DEBUG:  proc_exit(-1): 0 callbacks to
> make
> 2019-06-01 17:13:03.270 EDT [14908] DEBUG:  logger shutting down
> 2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(0): 0
> before_shmem_exit callbacks to make
> 2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(0): 0 on_shmem_exit
> callbacks to make
> 2019-06-01 17:13:03.270 EDT [14908] DEBUG:  proc_exit(0): 0 callbacks to
> make
> 2019-06-01 17:13:03.270 EDT [14908] DEBUG:  exit(0)
> 2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(-1): 0
> before_shmem_exit callbacks to make
> 2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(-1): 0
> on_shmem_exit callbacks to make
> 2019-06-01 17:13:03.270 EDT [14908] DEBUG:  proc_exit(-1): 0 callbacks to
> make
> ^C
> -bash-4.2$
>
> What's interesting is that f2W isn't a string you'd expect to be printed
> as part of the code logic   ( I mean, what is f2W? ).
>
> postgresql10-server-10.8-1PGDG.rhel7.x86_64
> postgresql10-libs-10.8-1PGDG.rhel7.x86_64
> postgresql10-contrib-10.8-1PGDG.rhel7.x86_64
> postgresql10-10.8-1PGDG.rhel7.x86_64
>
> Without the recovery.conf file it actually started up.  Going to tr

Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 5:51 PM Adrian Klaver 
wrote:

> On 6/1/19 2:31 PM, Tom K wrote:
> >
> >
>
> >
> > Spoke too soon.  There's no tables when it's started without the
> > recovery.conf file.
>
> Where there any errors in the start up?
>

Nothing I would discern as a startup error:

[root@psql03 patroni]# sudo su - postgres
Last login: Sat Jun  1 17:30:31 EDT 2019 on pts/1
-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf
--listen_addresses=192.168.0.118 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100 --hot_standby=on
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  postgres: PostmasterMain:
initial environment dump:
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
 -
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  XDG_SESSION_ID=219
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  HOSTNAME=psql03.nix.mds.xyz
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  SHELL=/bin/bash
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  TERM=xterm
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  HISTSIZE=1000
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  USER=postgres
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
 
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
 MAIL=/var/spool/mail/postgres
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
 PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  PWD=/var/lib/pgsql
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  LANG=en_US.UTF-8
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  HISTCONTROL=ignoredups
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  SHLVL=1
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  HOME=/var/lib/pgsql
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  LOGNAME=postgres
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
 PGDATA=/var/lib/pgsql/10/data
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
 LESSOPEN=||/usr/bin/lesspipe.sh %s
2019-06-01 17:41:27.478 EDT [15209] DEBUG:  _=/usr/pgsql-10/bin/postgres
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
 PGLOCALEDIR=/usr/pgsql-10/share/locale
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
 PGSYSCONFDIR=/etc/sysconfig/pgsql
2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_COLLATE=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_CTYPE=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_MESSAGES=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_MONETARY=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_NUMERIC=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_TIME=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
 -
2019-06-01 17:41:27.494 EDT [15209] DEBUG:  registering background worker
"logical replication launcher"
2019-06-01 17:41:27.495 EDT [15209] LOG:  listening on IPv4 address
"192.168.0.118", port 5432
2019-06-01 17:41:27.498 EDT [15209] LOG:  listening on Unix socket
"./.s.PGSQL.5432"
2019-06-01 17:41:27.499 EDT [15209] DEBUG:  invoking
IpcMemoryCreate(size=148545536)
2019-06-01 17:41:27.499 EDT [15209] DEBUG:  mmap(148897792) with
MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-06-01 17:41:27.514 EDT [15209] DEBUG:  SlruScanDirectory invoking
callback on pg_notify/
2019-0

Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 6:05 PM Tom K  wrote:

>
>
> On Sat, Jun 1, 2019 at 5:51 PM Adrian Klaver 
> wrote:
>
>> On 6/1/19 2:31 PM, Tom K wrote:
>> >
>> >
>>
>> >
>> > Spoke too soon.  There's no tables when it's started without the
>> > recovery.conf file.
>>
>> Where there any errors in the start up?
>>
>
> Nothing I would discern as a startup error:
>
> [root@psql03 patroni]# sudo su - postgres
> Last login: Sat Jun  1 17:30:31 EDT 2019 on pts/1
> -bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf
> --listen_addresses=192.168.0.118 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10 --max_connections=100 --hot_standby=on
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  postgres: PostmasterMain:
> initial environment dump:
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:
>  -
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  XDG_SESSION_ID=219
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  HOSTNAME=
> psql03.nix.mds.xyz
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  SHELL=/bin/bash
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  TERM=xterm
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  HISTSIZE=1000
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  USER=postgres
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:
>  
> LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:
>  MAIL=/var/spool/mail/postgres
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:
>  
> PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  PWD=/var/lib/pgsql
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  LANG=en_US.UTF-8
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  HISTCONTROL=ignoredups
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  SHLVL=1
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  HOME=/var/lib/pgsql
> 2019-06-01 17:41:27.477 EDT [15209] DEBUG:  LOGNAME=postgres
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:
>  PGDATA=/var/lib/pgsql/10/data
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:
>  LESSOPEN=||/usr/bin/lesspipe.sh %s
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:
>  _=/usr/pgsql-10/bin/postgres
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:
>  PGLOCALEDIR=/usr/pgsql-10/share/locale
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:
>  PGSYSCONFDIR=/etc/sysconfig/pgsql
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_COLLATE=en_US.UTF-8
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_CTYPE=en_US.UTF-8
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_MESSAGES=en_US.UTF-8
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_MONETARY=C
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_NUMERIC=C
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:  LC_TIME=C
> 2019-06-01 17:41:27.478 EDT [15209] DEBUG:
>  -
> 2019-06-01 17:41:27.494 EDT [15209] DEBUG:  registering background worker
> "logical replication launcher"
> 2019-06-01 17:41:27.495 EDT [15209] LOG:  listening on IPv4 address
> "192.168.0.118", port 5432
> 2019-06-

Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 6:39 PM Adrian Klaver 
wrote:

> On 6/1/19 3:14 PM, Tom K wrote:
>
> >
> > ** Correction.  There is postgres, template1 and template2 but none of
> > the other databases we had.
>
> Just noticed, is that really template2 or is it actually template0?


Apologies.  You're right, that was a typo:

-bash-4.2$ psql -h psql03 -U postgres
Password for user postgres:
psql (10.8)
Type "help" for help.

postgres=# \l
  List of databases
   Name|  Owner   | Encoding |   Collate   |Ctype|   Access
privileges
---+--+--+-+-+---
 postgres  | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
 template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
 +
   |  |  | | |
postgres=CTc/postgres
 template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
 +
   |  |  | | |
postgres=CTc/postgres
(3 rows)

postgres=#



>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 6:36 PM Adrian Klaver 
wrote:

> On 6/1/19 3:14 PM, Tom K wrote:
> >
> >
>
> > ** Correction.  There is postgres, template1 and template2 but none of
> > the other databases we had.
>
> In a psql session do:
>
> select oid, datname from pg_database ;
>
> Then go to /data/patroni an drill down to the base directory. In that
> directory there should be sub-directories with numbers equal to the oid
> values from the query. If not then something is wrong. That could be:
>
> 1) Wrong data directory.
>
> 2) Copied over incorrect data directory.
>
> Basically if the only sub-directories are for the system databases then
> the user databases do not exist. You might want to look at some of the
> other clusters to see what is there.
>
>
>

-bash-4.2$ pwd
/var/lib/pgsql
-bash-4.2$ psql -h psql03 -U postgres
Password for user postgres:
psql (10.8)
Type "help" for help.

postgres=# \l
  List of databases
   Name|  Owner   | Encoding |   Collate   |Ctype|   Access
privileges
---+--+--+-+-+---
 postgres  | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
 template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
 +
   |  |  | | |
postgres=CTc/postgres
 template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres
 +
   |  |  | | |
postgres=CTc/postgres
(3 rows)

postgres=# select oid, datname from pg_database ;
  oid  |  datname
---+---
 13806 | postgres
 1 | template1
 13805 | template0
(3 rows)

postgres=#
postgres=# cd /data/patroni
postgres-# ls -altri
postgres-# pwd
postgres-# pwd
postgres-# ls -altrid
postgres-# ;
ERROR:  syntax error at or near "cd"
LINE 1: cd /data/patroni
^
postgres=# \q
-bash-4.2$ ls -altri
total 652
134299846 drwxr-xr-x. 39 root root   4096 Oct 24  2018 ..
 67178653 -rw-r--r--.  1 postgres postgres 30 Oct 24  2018
.pgsql_profile
134734937 drwx--.  4 postgres postgres 31 May  8 06:25 10
 67178637 drwx--.  3 postgres postgres   4096 May  8 06:25 .
 68994450 -rw-r--r--.  1 postgres postgres 204336 May 26 22:37 s1.log
 67677594 -rwx--.  1 postgres postgres266 May 26 23:06 .bash_profile
 68994451 -rw-r--r--.  1 postgres postgres 433920 May 26 23:44 s2.log
 68391913 -rw---.  1 postgres postgres   4716 Jun  1 17:34 .bash_history
 67178638 -rw---.  1 postgres postgres325 Jun  1 18:54 .psql_history
-bash-4.2$ cd /data/patroni/
-bash-4.2$ ls -altri
total 144
 69085037 drwxr-xr-x.  3 root root20 Oct 23  2018 ..
135316997 -rw---.  1 postgres postgres   206 Oct 29  2018
backup_label.old
201708781 drwx--.  2 postgres postgres 6 Oct 29  2018 pg_commit_ts
  1502746 drwx--.  2 postgres postgres 6 Oct 29  2018 pg_dynshmem
 68994449 drwx--.  2 postgres postgres 6 Oct 29  2018 pg_twophase
  1502749 drwx--.  2 postgres postgres 6 Oct 29  2018 pg_snapshots
201708785 drwx--.  2 postgres postgres 6 Oct 29  2018 pg_serial
  1502747 drwx--.  4 postgres postgres34 Oct 29  2018 pg_multixact
 67677559 drwx--.  5 postgres postgres38 Oct 29  2018 base
  1502751 drwx--.  2 postgres postgres17 Oct 29  2018 pg_xact
  1502750 drwx--.  2 postgres postgres 6 Oct 29  2018 pg_tblspc
134668882 -rw---.  1 postgres postgres88 Oct 29  2018
postgresql.auto.conf
134500181 -rw---.  1 postgres postgres 3 Oct 29  2018 PG_VERSION
134500173 -rw---.  1 postgres postgres  1636 Oct 29  2018 pg_ident.conf
134500171 -rw---.  1 postgres postgres  4730 Oct 29  2018 pg_hba.conf
134668883 -rw---.  1 postgres postgres 22783 Oct 29  2018
postgresql.base.conf
134500170 -rw---.  1 postgres postgres   124 Oct 29  2018
patroni.dynamic.json
134668889 -rw-r--r--.  1 postgres postgres90 Oct 29  2018 recovery.done
201708786 drwx--.  2 postgres postgres17 Oct 29  2018 pg_subtrans
134500162 drwx--.  2 postgres postgres  4096 Nov  4  2018 log
134500182 drwx--.  3 postgres postgres  8192 Apr 28 06:06 pg_wal
  1502748 drwx--.  4 postgres postgres42 May 22 01:55 pg_replslot
134668893 -rw-r--r--.  1 postgres postgres90 May 26 22:34 t.txt
134668885 -rw-r--r--.  1 postgres postgres   575 May 26 22:34
postgresql.conf
134668886 -rw-r--r--.  1 postgres postgres   575 May 26 22:34
postgresql.conf.backup
134668884 -rw---.  1 postgres postgres 22783 May 26 22:34
postgresql.base.conf.backup
134500172 -rw---.  1 postgres postgres  4730 May 26 22:34
pg_hba.conf.backup
134500174 -rw---.  1 postgres postgres  1636 May 26 22:34
pg_ident.conf.backup
 68994437 drwx--.  4 postgres postgres65 Jun  1 17:34 pg_logical
134734900 -rw---.  1 postgres postgres38 

Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 7:12 PM Adrian Klaver 
wrote:

> On 6/1/19 3:56 PM, Tom K wrote:
> >
> >
>
> >
> > postgres=# select oid, datname from pg_database ;
> >oid  |  datname
> > ---+---
> >   13806 | postgres
> >   1 | template1
> >   13805 | template0
> > (3 rows)
> >
>
> So there are only the system databases available
>
> > -bash-4.2$ cd /data/patroni/
> > -bash-4.2$ ls -altri
> > total 144
> >   69085037 drwxr-xr-x.  3 root root20 Oct 23  2018 ..
> > 135316997 -rw---.  1 postgres postgres   206 Oct 29  2018
> > backup_label.old
> > 201708781 drwx--.  2 postgres postgres 6 Oct 29  2018
> pg_commit_ts
> >1502746 drwx--.  2 postgres postgres 6 Oct 29  2018
> pg_dynshmem
> >   68994449 drwx--.  2 postgres postgres 6 Oct 29  2018
> pg_twophase
> >1502749 drwx--.  2 postgres postgres 6 Oct 29  2018
> pg_snapshots
> > 201708785 drwx--.  2 postgres postgres 6 Oct 29  2018 pg_serial
> >1502747 drwx--.  4 postgres postgres34 Oct 29  2018
> pg_multixact
> >   67677559 drwx--.  5 postgres postgres38 Oct 29  2018 base
>
> base/ is the directory you need to look in. I'm guessing it is only
> going to show the oid/ for the three db's above and  pgsql_tmp/
>
> For more info on this see:
> https://www.postgresql.org/docs/10/storage-file-layout.html
>
> The below looks like the RH package installed data directory. Also looks
> like it either has never had initdb run against or the files where removed.
>
> I thought you said you had copied in data directories from the other
> nodes, did I remember correctly?
>
> > -bash-4.2$ cd
> > -bash-4.2$ cd 10
> > -bash-4.2$ ls
> > backups  data
> > -bash-4.2$ pwd
> > /var/lib/pgsql/10
> > -bash-4.2$ cd data/
> > -bash-4.2$ ls -altri
> > total 0
> > 134734937 drwx--. 4 postgres postgres 31 May  8 06:25 ..
> > 245519 drwx--. 2 postgres postgres  6 May  8 06:25 .
> > -bash-4.2$ cd ..
> > -bash-4.2$ pwd
> > /var/lib/pgsql/10
> > -bash-4.2$ cd ..
> > -bash-4.2$
>
>
Yep, you remembered correctly.

I copied the files as they were, out to a temporary folder under root for
each node but never dug into base/ etc any further to check things.  So
here's the state of things in the base/ folder of the backup of each node.

[ PSQL03 ]
[root@psql03 base]# ls -altri
total 40
42424 drwx--.  2 postgres postgres 8192 Oct 29  2018 1
 67714749 drwx--.  2 postgres postgres 8192 Oct 29  2018 13805
202037206 drwx--.  5 postgres postgres   38 Oct 29  2018 .
134312175 drwx--.  2 postgres postgres 8192 May 22 01:55 13806
89714 drwxr-xr-x. 20 root root 4096 May 22 22:43 ..
[root@psql03 base]#



[ PSQL02 ]
 [root@psql02 base]# ls -altri
total 412
201426668 drwx--.  2 postgres postgres  8192 Oct 29  2018 1
   743426 drwx--.  2 postgres postgres  8192 Mar 24 03:47 13805
135326327 drwx--.  2 postgres postgres 16384 Mar 24 20:15 40970
   451699 drwx--.  2 postgres postgres 40960 Mar 25 19:47 16395
  1441696 drwx--.  2 postgres postgres  8192 Mar 31 15:09 131137
 68396137 drwx--.  2 postgres postgres  8192 Mar 31 15:09 131138
135671065 drwx--.  2 postgres postgres  8192 Mar 31 15:09 131139
204353100 drwx--.  2 postgres postgres  8192 Mar 31 15:09 131140
135326320 drwx--. 17 postgres postgres  4096 Apr 14 10:08 .
 68574415 drwx--.  2 postgres postgres 12288 Apr 28 06:06 131142
   288896 drwx--.  2 postgres postgres 16384 Apr 28 06:06 131141
203015232 drwx--.  2 postgres postgres  8192 Apr 28 06:06 131136
135326328 drwx--.  2 postgres postgres 40960 May  5 22:09 24586
 67282461 drwx--.  2 postgres postgres  8192 May  5 22:09 13806
 67640961 drwx--.  2 postgres postgres 20480 May  5 22:09 131134
203500274 drwx--.  2 postgres postgres 16384 May  5 22:09 155710
134438257 drwxr-xr-x. 20 root root  4096 May 22 01:44 ..
[root@psql02 base]# pwd
/root/postgres-patroni-backup/base
[root@psql02 base]#


[ PSQL01 ]
[root@psql01 base]# ls -altri
total 148
134704615 drwx--.  2 postgres postgres  8192 Oct 29  2018 1
201547700 drwx--.  2 postgres postgres  8192 Oct 29  2018 13805
   160398 drwx--.  2 postgres postgres  8192 Feb 24 23:53 13806
 67482137 drwx--.  7 postgres postgres62 Feb 24 23:54 .
135909671 drwx--.  2 postgres postgres 24576 Feb 24 23:54 24586
13555 drwx--.  2 postgres postgres 24576 Feb 24 23:54 16395
 67178716 drwxr-xr-x. 20 root root  4096 May 22 01:53 ..
[root@psql01 base]# pwd
/root/postgresql-patroni-etcd/base
[root@psql01 base]#

Looks like this crash was far more catastrophic then I thought.  By the
looks of things, thinking on psql02 would be my best bet.



> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver 
wrote:

> On 6/1/19 4:22 PM, Tom K wrote:
> >
> >
>
> >
> > Looks like this crash was far more catastrophic then I thought.  By the
> > looks of things, thinking on psql02 would be my best bet.
> >
>
> The more I look at it the more I think the replication was not doing
> what you thought it was doing. That psql02 was the primary and that
> psql01 and psql03 where out of sync and/or defunct standbys.
>

Now that I look at the files myself, that's the conclusion I was coming to
myself.  Sample config:

[root@psql02 base]# cat /etc/patroni.yml
scope: postgres
namespace: /db/
name: postgresql1

restapi:
listen: 192.168.0.124:8008
connect_address: 192.168.0.124:8008

etcd:
host: 192.168.0.124:2379

bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true

initdb:
- encoding: UTF8
- data-checksums

pg_hba:
- host replication replicator 127.0.0.1/32 md5
- host replication replicator 192.168.0.108/0 md5
- host replication replicator 192.168.0.124/0 md5
- host replication replicator 192.168.0.118/0 md5
- host all all 0.0.0.0/0 md5

users:
admin:
password: admin
options:
- createrole
- createdb

postgresql:
listen: 192.168.0.124:5432
bin_dir: /usr/pgsql-10/bin
connect_address: 192.168.0.124:5432
data_dir: /data/patroni
pgpass: /tmp/pgpass
unix_socket_directories: /data/patroni
authentication:
replication:
username: replicator
password: rep-pass
superuser:
username: postgres
password: 
parameters:
unix_socket_directories: '.'

tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
[root@psql02 base]#

Or perhaps when the system crashed, the filesystem check simply moved the
folders out due to corruption.


>
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
So the best bet will be trying to get through this error then:

[ PSQL02 ]
PANIC:replication check point has wrong magic 0 instead of  307747550




On Sat, Jun 1, 2019 at 8:21 PM Tom K  wrote:

>
>
> On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver 
> wrote:
>
>> On 6/1/19 4:22 PM, Tom K wrote:
>> >
>> >
>>
>> >
>> > Looks like this crash was far more catastrophic then I thought.  By the
>> > looks of things, thinking on psql02 would be my best bet.
>> >
>>
>> The more I look at it the more I think the replication was not doing
>> what you thought it was doing. That psql02 was the primary and that
>> psql01 and psql03 where out of sync and/or defunct standbys.
>>
>
> Now that I look at the files myself, that's the conclusion I was coming to
> myself.  Sample config:
>
> [root@psql02 base]# cat /etc/patroni.yml
> scope: postgres
> namespace: /db/
> name: postgresql1
>
> restapi:
> listen: 192.168.0.124:8008
> connect_address: 192.168.0.124:8008
>
> etcd:
> host: 192.168.0.124:2379
>
> bootstrap:
> dcs:
> ttl: 30
> loop_wait: 10
> retry_timeout: 10
> maximum_lag_on_failover: 1048576
> postgresql:
> use_pg_rewind: true
>
> initdb:
> - encoding: UTF8
> - data-checksums
>
> pg_hba:
> - host replication replicator 127.0.0.1/32 md5
> - host replication replicator 192.168.0.108/0 md5
> - host replication replicator 192.168.0.124/0 md5
> - host replication replicator 192.168.0.118/0 md5
> - host all all 0.0.0.0/0 md5
>
> users:
> admin:
> password: admin
> options:
> - createrole
> - createdb
>
> postgresql:
> listen: 192.168.0.124:5432
> bin_dir: /usr/pgsql-10/bin
> connect_address: 192.168.0.124:5432
> data_dir: /data/patroni
> pgpass: /tmp/pgpass
> unix_socket_directories: /data/patroni
> authentication:
> replication:
> username: replicator
> password: rep-pass
> superuser:
> username: postgres
> password: 
> parameters:
> unix_socket_directories: '.'
>
> tags:
> nofailover: false
> noloadbalance: false
> clonefrom: false
> nosync: false
> [root@psql02 base]#
>
> Or perhaps when the system crashed, the filesystem check simply moved the
> folders out due to corruption.
>
>
>>
>>
>>
>> --
>> Adrian Klaver
>> adrian.kla...@aklaver.com
>>
>


Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 8:25 PM Tom K  wrote:

> So the best bet will be trying to get through this error then:
>
> [ PSQL02 ]
> PANIC:replication check point has wrong magic 0 instead of  307747550
>
>
>
>
> On Sat, Jun 1, 2019 at 8:21 PM Tom K  wrote:
>
>>
>>
>> On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver 
>> wrote:
>>
>>> On 6/1/19 4:22 PM, Tom K wrote:
>>> >
>>> >
>>>
>>> >
>>> > Looks like this crash was far more catastrophic then I thought.  By
>>> the
>>> > looks of things, thinking on psql02 would be my best bet.
>>> >
>>>
>>> The more I look at it the more I think the replication was not doing
>>> what you thought it was doing. That psql02 was the primary and that
>>> psql01 and psql03 where out of sync and/or defunct standbys.
>>>
>>
>> Now that I look at the files myself, that's the conclusion I was coming
>> to myself.  Sample config:
>>
>> [root@psql02 base]# cat /etc/patroni.yml
>> scope: postgres
>> namespace: /db/
>> name: postgresql1
>>
>> restapi:
>> listen: 192.168.0.124:8008
>> connect_address: 192.168.0.124:8008
>>
>> etcd:
>> host: 192.168.0.124:2379
>>
>> bootstrap:
>> dcs:
>> ttl: 30
>> loop_wait: 10
>> retry_timeout: 10
>> maximum_lag_on_failover: 1048576
>> postgresql:
>> use_pg_rewind: true
>>
>> initdb:
>> - encoding: UTF8
>> - data-checksums
>>
>> pg_hba:
>> - host replication replicator 127.0.0.1/32 md5
>> - host replication replicator 192.168.0.108/0 md5
>> - host replication replicator 192.168.0.124/0 md5
>> - host replication replicator 192.168.0.118/0 md5
>> - host all all 0.0.0.0/0 md5
>>
>> users:
>> admin:
>> password: admin
>> options:
>> - createrole
>> - createdb
>>
>> postgresql:
>> listen: 192.168.0.124:5432
>> bin_dir: /usr/pgsql-10/bin
>> connect_address: 192.168.0.124:5432
>> data_dir: /data/patroni
>> pgpass: /tmp/pgpass
>> unix_socket_directories: /data/patroni
>> authentication:
>> replication:
>> username: replicator
>> password: rep-pass
>> superuser:
>> username: postgres
>> password: 
>> parameters:
>> unix_socket_directories: '.'
>>
>> tags:
>> nofailover: false
>> noloadbalance: false
>> clonefrom: false
>> nosync: false
>> [root@psql02 base]#
>>
>> Or perhaps when the system crashed, the filesystem check simply moved the
>> folders out due to corruption.
>>
>>
>>>
>>>
Trying what we did above but on the second node:

2019-06-01 20:28:45.305 EDT [21745] LOG:  database system was interrupted;
last known up at 2019-04-28 06:06:24 EDT
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  primary_slot_name =
'postgresql1'
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  standby_mode = 'on'
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  recovery_target_timeline =
latest
2019-06-01 20:28:46.142 EDT [21745] WARNING:  recovery command file
"recovery.conf" specified neither primary_conninfo nor restore_command
2019-06-01 20:28:46.142 EDT [21745] HINT:  The database server will
regularly poll the pg_wal subdirectory to check for files placed there.
2019-06-01 20:28:46.142 EDT [21745] LOG:  entering standby mode
2019-06-01 20:28:46.142 EDT [21745] LOG:  invalid record length at
0/4C35CDF8: wanted 24, got 0
2019-06-01 20:28:46.142 EDT [21745] LOG:  invalid primary checkpoint record
2019-06-01 20:28:46.142 EDT [21745] LOG:  using previous checkpoint record
at 0/4C34EDA8
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  redo record is at 0/4C34ED70;
shutdown FALSE
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  next transaction ID: 0:1409831;
next OID: 237578
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  next MultiXactId: 48; next
MultiXactOffset: 174
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  oldest unfrozen transaction ID:
549, in database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  oldest MultiXactId: 1, in
database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  commit timestamp Xid
oldest/newest: 0/0
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  transaction ID wrap limit is
2147484196, limited by database with OID 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  MultiXactId wrap limit is
2147483648, limited by database with OID

Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
ect as any PostgreSQL user, including
# the database superuser.  If you do not trust all your local users,
# use another authentication method.


# TYPE  DATABASEUSERADDRESS METHOD

# "local" is for Unix domain socket connections only
local   all all trust
# IPv4 local connections:
hostall all 127.0.0.1/32trust
# IPv6 local connections:
hostall all ::1/128 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication all trust
hostreplication all 127.0.0.1/32trust
host    replication all ::1/128 trust

host replication replicator 127.0.0.1/32 md5
host replication replicator 192.168.0.108/0 md5
host replication replicator 192.168.0.124/0 md5
host replication replicator 192.168.0.118/0 md5
host all all 0.0.0.0/0 md5
-bash-4.2$ pwd
/data/patroni
-bash-4.2$

-bash-4.2$ cat postgresql.base.conf|grep -v "#"|sed -e "/^$/d"
log_timezone = 'Canada/Eastern'
datestyle = 'iso, mdy'
timezone = 'Canada/Eastern'
default_text_search_config = 'pg_catalog.english'
-bash-4.2$


On Sat, Jun 1, 2019 at 8:36 PM Adrian Klaver 
wrote:

> On 6/1/19 5:21 PM, Tom K wrote:
> >
> >
> > On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver  > <mailto:adrian.kla...@aklaver.com>> wrote:
> >
> > On 6/1/19 4:22 PM, Tom K wrote:
> >  >
> >  >
> >
> >  >
> >  > Looks like this crash was far more catastrophic then I thought.
> > By the
> >  > looks of things, thinking on psql02 would be my best bet.
> >  >
> >
> > The more I look at it the more I think the replication was not doing
> > what you thought it was doing. That psql02 was the primary and that
> > psql01 and psql03 where out of sync and/or defunct standbys.
> >
> >
> > Now that I look at the files myself, that's the conclusion I was coming
> > to myself.  Sample config:
>
> The below would be for someone that uses and understands Patroni. That
> would not be me:)
>
> >
> > [root@psql02 base]# cat /etc/patroni.yml
> > scope: postgres
> > namespace: /db/
> > name: postgresql1
> >
> > restapi:
> >  listen: 192.168.0.124:8008 <http://192.168.0.124:8008>
> >  connect_address: 192.168.0.124:8008 <http://192.168.0.124:8008>
>
> >
> > Or perhaps when the system crashed, the filesystem check simply moved
> > the folders out due to corruption.
>
> That would leave the cluster in an inconsistent state and you would not
> have been able to start the one you got going.
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: psql: FATAL: the database system is starting up

2019-06-01 Thread Tom K
On Sat, Jun 1, 2019 at 8:53 PM Adrian Klaver 
wrote:

> On 6/1/19 5:32 PM, Tom K wrote:
>
> >
> >
> > Trying what we did above but on the second node:
> >
>
> Was this node the primary?
>
> To me the below looks like there are replication slots set up that are
> failing. Not sure how to deal with this at the moment. You might try
> single-user mode:
>
> https://www.postgresql.org/docs/10/app-postgres.html
> Single-User Mode
> ...
>
> and see if that at least gets the server started. This is a highly
> restricted so do not expect much usability.
>

These servers did crash before however didn't' notice anything particularly
wrong at the time. Perhaps a better check was in order.

Here is the result from attempting to start the DB in single user mode.

-bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=off
--listen_addresses=192.168.0.124 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres
--wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off
--max_prepared_transactions=0 --port=5432 --max_replication_slots=10
--max_connections=20 -d 5 2>&1
2019-06-01 23:04:20.876 EDT [31831] DEBUG:  invoking
IpcMemoryCreate(size=144687104)
2019-06-01 23:04:20.877 EDT [31831] DEBUG:  mmap(144703488) with
MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-06-01 23:04:20.890 EDT [31831] DEBUG:  SlruScanDirectory invoking
callback on pg_notify/
2019-06-01 23:04:20.890 EDT [31831] DEBUG:  removing file "pg_notify/"
2019-06-01 23:04:20.890 EDT [31831] DEBUG:  dynamic shared memory system
will support 128 segments
2019-06-01 23:04:20.891 EDT [31831] DEBUG:  created dynamic shared memory
control segment 846930886 (3088 bytes)
2019-06-01 23:04:20.891 EDT [31831] DEBUG:  InitPostgres
2019-06-01 23:04:20.891 EDT [31831] DEBUG:  my backend ID is 1
2019-06-01 23:04:20.891 EDT [31831] LOG:  database system was interrupted;
last known up at 2019-04-28 06:06:24 EDT
2019-06-01 23:04:21.239 EDT [31831] LOG:  invalid record length at
0/4C35CDF8: wanted 24, got 0
2019-06-01 23:04:21.239 EDT [31831] LOG:  invalid primary checkpoint record
2019-06-01 23:04:21.239 EDT [31831] LOG:  using previous checkpoint record
at 0/4C34EDA8
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  redo record is at 0/4C34ED70;
shutdown FALSE
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  next transaction ID: 0:1409831;
next OID: 237578
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  next MultiXactId: 48; next
MultiXactOffset: 174
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  oldest unfrozen transaction ID:
549, in database 1
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  oldest MultiXactId: 1, in
database 1
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  commit timestamp Xid
oldest/newest: 0/0
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  transaction ID wrap limit is
2147484196, limited by database with OID 1
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  MultiXactId wrap limit is
2147483648, limited by database with OID 1
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  starting up replication slots
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  starting up replication origin
progress state
2019-06-01 23:04:21.239 EDT [31831] PANIC:  replication checkpoint has
wrong magic 0 instead of 307747550
Aborted
-bash-4.2$
-bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni
--config-file=/data/patroni/postgresql.conf
2019-06-01 23:06:28.679 EDT [31969] LOG:  database system was interrupted;
last known up at 2019-04-28 06:06:24 EDT
2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid record length at
0/4C35CDF8: wanted 24, got 0
2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid primary checkpoint record
2019-06-01 23:06:28.953 EDT [31969] LOG:  using previous checkpoint record
at 0/4C34EDA8
2019-06-01 23:06:28.953 EDT [31969] PANIC:  replication checkpoint has
wrong magic 0 instead of 307747550
Aborted
-bash-4.2$



>
> > Removing the recovery.conf and restarting yields the following messages:
> >
> > 2019-06-01 20:31:12.231 EDT [21910] LOG:  database system was
> > interrupted; last known up at 2019-04-28 06:06:24 EDT
> > 2019-06-01 20:31:12.505 EDT [21910] LOG:  invalid record length at
> > 0/4C35CDF8: wanted 24, got 0
> > 2019-06-01 20:31:12.505 EDT [21910] LOG:  invalid primary checkpoint
> record
> > 2019-06-01 20:31:12.505 EDT [21910] LOG:  using previous checkpoint
> > record at 0/4C34EDA8
> > 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  redo record is at
> > 0/4C34ED70; shutdown FALSE
> > 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  next transaction ID:
> > 0:1409831; next OID: 237578
> > 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  next MultiXactId: 48; next
> > MultiXactOffset: 174
> > 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  oldest unfr

Re: psql: FATAL: the database system is starting up

2019-06-02 Thread Tom K
On Sun, Jun 2, 2019 at 11:47 AM Adrian Klaver 
wrote:

> On 6/1/19 8:07 PM, Tom K wrote:
>
> >
> > https://www.postgresql.org/docs/10/app-postgres.html
> > Single-User Mode
> > ...
> >
> > and see if that at least gets the server started. This is a highly
> > restricted so do not expect much usability.
> >
> >
> > These servers did crash before however didn't' notice anything
> > particularly wrong at the time. Perhaps a better check was in order.
> >
> > Here is the result from attempting to start the DB in single user mode.
>
> > -bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni
> > --config-file=/data/patroni/postgresql.conf
> > 2019-06-01 23:06:28.679 EDT [31969] LOG:  database system was
> > interrupted; last known up at 2019-04-28 06:06:24 EDT
> > 2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid record length at
> > 0/4C35CDF8: wanted 24, got 0
> > 2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid primary checkpoint
> record
> > 2019-06-01 23:06:28.953 EDT [31969] LOG:  using previous checkpoint
> > record at 0/4C34EDA8
> > 2019-06-01 23:06:28.953 EDT [31969] PANIC:  replication checkpoint has
> > wrong magic 0 instead of 307747550
>
> That comes from origin.c in the logical replication section of the code.
>
> Did you have logical replication set up?
>

Nope. wal_level was set to replica, not logical.  Unless you mean


> What was the role of this cluster in the original setup?
>

The cluster was the backend database for a number of applications.  The aim
was to point applications to a single large cluster instead of a number of
small instances of postgres running all over the LAB.

If I can get through the above error and get to the DB's and tables, I
could just dump them out and reinit the entire DB from the backup.


>
> > Aborted
> > -bash-4.2$
> >
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: psql: FATAL: the database system is starting up

2019-06-02 Thread Tom K
Hey Adrian,

Fixed it.

I saw the post from jebriggs but that didn't work for me so posted here.
Anyway, here's how I resolved it:

When I ran an strace on the postgres startup line, I got this:

open("pg_logical/replorigin_checkpoint", O_RDONLY) = 6
write(2, "2019-06-02 14:50:34.777 EDT [283"..., 1062019-06-02 14:50:34.777
EDT [28362] PANIC:  replication checkpoint has wrong magic 0 instead of
307747550
-bash-4.2$

Notice how it's reading *pg_logical/replorigin_checkpoint* even though
wal_level = 'replicate' .  It sure looks like it's doing logical
replication.  Anyway, I checked the file and saw this:

-bash-4.2$ ls -altri pg_logical/
total 20
 67894871 -rw---.  1 postgres postgres 16384 Oct 29  2018
replorigin_checkpoint
136946383 drwx--.  2 postgres postgres 6 Oct 29  2018 snapshots
204367784 drwx--.  2 postgres postgres 6 Oct 29  2018 mappings
 67894870 drwx--.  4 postgres postgres65 Apr 28 06:06 .
135326272 drwx--. 21 postgres postgres  4096 Jun  2 14:50 ..
-bash-4.2$

Notice the dates and size.  All way off.  Now I checked the same file on
the other nodes:

[root@psql03 pg_logical]# ls -altri
total 8
 68994432 drwx--.  2 postgres postgres6 Oct 29  2018 snapshots
134984156 drwx--.  2 postgres postgres6 Oct 29  2018 mappings
   566745 -rw---.  1 postgres postgres8 May 22 01:55
replorigin_checkpoint
   566731 drwx--.  4 postgres postgres   65 May 22 01:55 .
89714 drwxr-xr-x. 20 root root 4096 May 22 22:43 ..
[root@psql03 pg_logical]#

Dates of the other hosts replorigin_checkpoint are much more recent and
smaller.

So I take the replorigin_checkpoint and copy it over to the former primary
node  (  From psql03 to psql02(primary)  )  and try a restart.

Everything started up.  Replication started to the other postgres nodes
without issues.  Verified base/ folder on all nodes to ensure replication
is working. Every table is back to the way it was before this entire
outage.

Adrian, thank you very much for all the help here.  It was much
appreciated.  I've written this up and posted it at the following two links
so others can benefit and I don't forget.

https://microdevsys.com/wp/panic-replication-checkpoint-has-wrong-magic-0-instead-of-307747550/


And of course a handy way to do backups, which I should have implemented
before this happened:

https://microdevsys.com/wp/postgresql-pull-backup/

Thx,
TK

On Sun, Jun 2, 2019 at 4:48 PM Adrian Klaver 
wrote:

> On 6/2/19 11:14 AM, Tom K wrote:
>
> >
> > Nope. wal_level was set to replica, not logical.  Unless you mean
> >
> >
> > What was the role of this cluster in the original setup?
> >
> >
> > The cluster was the backend database for a number of applications.  The
> > aim was to point applications to a single large cluster instead of a
> > number of small instances of postgres running all over the LAB.
> >
> > If I can get through the above error and get to the DB's and tables, I
> > could just dump them out and reinit the entire DB from the backup.
>
> The only thing I could find is the below:
>
>
> http://www.jebriggs.com/blog/2018/05/postgresql-and-panic-replication-checkpoint-has-wrong-magic-error/
>
> If that does not work I would suggest sending a new post(new subject) to
> the mailing list based on:
>
> replication checkpoint has wrong magic 0 instead of 307747550
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>
>
>