Correct setup for 3+ node cluster with streaming replication and WAL archiving

2019-02-14 Thread Andre Piwoni
hive/00010005
2019-02-14 11:13:19.199 PST [1579] DEBUG:  attempting to remove WAL
segments older than log file 0006
2019-02-14 11:13:19.200 PST [1579] DEBUG:  recycled write-ahead log file
"00010006"
2019-02-14 11:13:20.201 PST [2346] LOG:  archive command failed with exit
code 1

OTHER SLAVE
2019-02-14 11:13:14.723 PST [5807] FATAL:  could not connect to the primary
server: could not connect to server: Connection refused
Is the server running on host "pg-hdp-node1.kitchen.local"
(172.28.128.54) and accepting
TCP/IP connections on port 5432?
2019-02-14 11:13:16.072 PST [1839] DEBUG:  performing replication slot
checkpoint
2019-02-14 11:13:16.077 PST [1839] DEBUG:  attempting to remove WAL
segments older than log file 0005
2019-02-14 11:13:16.078 PST [1839] DEBUG:  recycled write-ahead log file
"00010005"
2019-02-14 11:13:19.412 PST [1838] DEBUG:  switched WAL source from stream
to archive after failure
2019-02-14 11:13:19.412 PST [1838] DEBUG:  invalid resource manager ID 64
at 0/798
2019-02-14 11:13:19.412 PST [1838] DEBUG:  switched WAL source from archive
to stream after failure
2019-02-14 11:13:24.418 PST [1839] LOG:  shutting down
2019-02-14 11:13:24.427 PST [1834] LOG:  database system is shut down
2019-02-14 11:13:25.096 PST [5839] LOG:  database system was shut down in
recovery at 2019-02-14 11:13:24 PST
2019-02-14 11:13:25.096 PST [5839] DEBUG:  standby_mode = 'on'
2019-02-14 11:13:25.096 PST [5839] DEBUG:  primary_conninfo =
'user=replication password=password host=pg-hdp-node3.kitchen.local
port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres
target_session_attrs=any'
2019-02-14 11:13:25.096 PST [5839] DEBUG:  restore_command = 'cp
/mnt/pgsql/archive/%f %p'
2019-02-14 11:13:25.096 PST [5839] LOG:  entering standby mode
cp: cannot stat ‘/mnt/pgsql/archive/00010007’: No such file
or directory
2019-02-14 11:13:25.102 PST [5839] DEBUG:  could not restore file
"00010007" from archive: child process exited with exit
code 1
2019-02-14 11:13:25.149 PST [5839] DEBUG:  switched WAL source from stream
to archive after failure
cp: cannot stat ‘/mnt/pgsql/archive/00010007’: No such file
or directory
2019-02-14 11:13:25.154 PST [5839] DEBUG:  could not restore file
"00010007" from archive: child process exited with exit
code 1
2019-02-14 11:13:25.154 PST [5839] DEBUG:  invalid resource manager ID 64
at 0/798

*Andre Piwoni*


Promoted slave tries to archive previously archived WAL file

2019-02-15 Thread Andre Piwoni
I have master and slave running with the following contents of their pg_wal
directories and archivedir:

ls -l /mnt/pgsql/archive/
-rw-rw-rw-. 1 root root 16777216 Feb 15 09:39 00010001
-rw-rw-rw-. 1 root root 16777216 Feb 15 09:39 00010002
-rw-rw-rw-. 1 root root  302 Feb 15 09:39
00010002.0028.backup

pg-hdp-node1.kitchen.local
/var/lib/pgsql/10/data/pg_wal/:
-rw---. 1 postgres postgres 16777216 Feb 15 09:39
00010002
-rw---. 1 postgres postgres  302 Feb 15 09:39
00010002.0028.backup
-rw---. 1 postgres postgres 16777216 Feb 15 09:44
00010003
-rw---. 1 postgres postgres 16777216 Feb 15 09:39
00010004
drwx--. 2 postgres postgres   96 Feb 15 09:44 archive_status
/var/lib/pgsql/10/data/pg_wal/archive_status:
-rw---. 1 postgres postgres 0 Feb 15 09:39
00010002.0028.backup.done
-rw---. 1 postgres postgres 0 Feb 15 09:39 00010002.done

pg-hdp-node2.kitchen.local
/var/lib/pgsql/10/data/pg_wal/:
-rw---. 1 postgres root 16777216 Feb 15 09:39
00010002
-rw---. 1 postgres postgres 16777216 Feb 15 09:44
00010003
drwx--. 2 postgres root6 Feb 15 09:39 archive_status
/var/lib/pgsql/10/data/pg_wal/archive_status:

diff from secondary pg-hdp-node2.kitchen.local on
/var/lib/pgsql/10/data/pg_wal/00010002 and
/mnt/pgsql/archive/00010002 shows binary differences but as
expected no differences for diff on primary pg-hdp-node1.kitchen.local

Failover is performed and pg-hdp-node2.kitchen.local tries and fails to
archive WAL segment 00010002 because it has been previously
archived
2019-02-15 09:54:50.518 PST [780] DETAIL:  The failed archive command was:
test ! -f /mnt/pgsql/archive/00010002 && cp
pg_wal/00010002 /mnt/pgsql/archive/00010002

Based on this thread
https://www.postgresql.org/message-id/11b405a6-2176-9510-bf5b-ea9c0e860635%40pgmasters.net
it is suggested to handle this case by reporting success but in my case
contents are different. I would think that previously archived
00010002 is the right WAL segment.

So my questions are as follows:

Why WAL segments differ?
How should this be resolved on the new primary?
-- 

*Andre Piwoni*


Re: Promoted slave tries to archive previously archived WAL file

2019-02-15 Thread Andre Piwoni
Ok. I think I uncovered a bug.

My slave nodes were created using pg_basebackup with --wal-method=stream.
If I understand right this option streams WAL files generated during backup
and this WAL file was 00010002 but its contents were
different from what was on the primary and in WAL archive. When I changed
--wal-method=fetch which collects WAL files generated during the backup at
the end then diff did not detect any changes. When I failover I don't seem
to have the issue with new primary to archive.

It seems streaming WAL segments created during backup produced corrupt file
based on diff.





On Fri, Feb 15, 2019 at 10:23 AM Andre Piwoni  wrote:

> I have master and slave running with the following contents of their
> pg_wal directories and archivedir:
>
> ls -l /mnt/pgsql/archive/
> -rw-rw-rw-. 1 root root 16777216 Feb 15 09:39 00010001
> -rw-rw-rw-. 1 root root 16777216 Feb 15 09:39 00010002
> -rw-rw-rw-. 1 root root  302 Feb 15 09:39
> 00010002.0028.backup
>
> pg-hdp-node1.kitchen.local
> /var/lib/pgsql/10/data/pg_wal/:
> -rw---. 1 postgres postgres 16777216 Feb 15 09:39
> 00010002
> -rw---. 1 postgres postgres  302 Feb 15 09:39
> 00010002.0028.backup
> -rw---. 1 postgres postgres 16777216 Feb 15 09:44
> 00010003
> -rw---. 1 postgres postgres 16777216 Feb 15 09:39
> 00010004
> drwx--. 2 postgres postgres   96 Feb 15 09:44 archive_status
> /var/lib/pgsql/10/data/pg_wal/archive_status:
> -rw---. 1 postgres postgres 0 Feb 15 09:39
> 00010002.0028.backup.done
> -rw---. 1 postgres postgres 0 Feb 15 09:39
> 00010002.done
>
> pg-hdp-node2.kitchen.local
> /var/lib/pgsql/10/data/pg_wal/:
> -rw---. 1 postgres root 16777216 Feb 15 09:39
> 00010002
> -rw---. 1 postgres postgres 16777216 Feb 15 09:44
> 00010003
> drwx--. 2 postgres root6 Feb 15 09:39 archive_status
> /var/lib/pgsql/10/data/pg_wal/archive_status:
>
> diff from secondary pg-hdp-node2.kitchen.local on
> /var/lib/pgsql/10/data/pg_wal/00010002 and
> /mnt/pgsql/archive/00010002 shows binary differences but as
> expected no differences for diff on primary pg-hdp-node1.kitchen.local
>
> Failover is performed and pg-hdp-node2.kitchen.local tries and fails to
> archive WAL segment 00010002 because it has been previously
> archived
> 2019-02-15 09:54:50.518 PST [780] DETAIL:  The failed archive command was:
> test ! -f /mnt/pgsql/archive/00010002 && cp
> pg_wal/00010002 /mnt/pgsql/archive/00010002
>
> Based on this thread
> https://www.postgresql.org/message-id/11b405a6-2176-9510-bf5b-ea9c0e860635%40pgmasters.net
> it is suggested to handle this case by reporting success but in my case
> contents are different. I would think that previously archived
> 00010002 is the right WAL segment.
>
> So my questions are as follows:
>
> Why WAL segments differ?
> How should this be resolved on the new primary?
> --
>
> *Andre Piwoni*
>


-- 

*Andre Piwoni*

Sr. Software Developer, BI/Database

*Web*MD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com


Re: Promoted slave tries to archive previously archived WAL file

2019-02-19 Thread Andre Piwoni
I call pg_ctl -D /var/lib/pgsql/10/data promote to upgrade slave to master
when failover happens
I repoint slave to the master by stopping it, updating recovery.conf and
restarting it. Let me know if I'm doing it wrong.

Thank you for clarifying that contents of WAL files from stream and fetch
method should be different. The fact remains that WAL file created during
the backup by stream method does not have extension .done and new master
repeatedly fails to archive it because previous master archived it. This
does not happen in fetch mode where file created during backup has .done
extension.


On Sun, Feb 17, 2019 at 5:27 PM Michael Paquier  wrote:

> On Fri, Feb 15, 2019 at 12:03:39PM -0800, Andre Piwoni wrote:
> > My slave nodes were created using pg_basebackup with --wal-method=stream.
> > If I understand right this option streams WAL files generated during
> backup
> > and this WAL file was 00010002 but its contents were
> > different from what was on the primary and in WAL archive. When I changed
> > --wal-method=fetch which collects WAL files generated during the backup
> at
> > the end then diff did not detect any changes. When I failover I don't
> seem
> > to have the issue with new primary to archive.
> >
> > It seems streaming WAL segments created during backup produced corrupt
> file
> > based on diff.
>
> Nope, --wal-method=stream and fetch behave as they are designed for.
> "fetch" would include all the contents of pg_wal/ as part of a
> BASE_BACKUP command using the replication protocol while "stream"
> would make the base backup avoid copying all WAL files, copying them
> in parallel with a secondary process which uses the same facility as
> pg_receiverwal, in which case the set of files in
> pg_wal/archive_status/ has little meaning.
>
> I am also curious about the way you use to do failovers.  Let me
> guess: you stop the standby, delete its recovery.conf and then restart
> the former standby?  This would prevent a timeline jump at promotion
> which would explain the conflicts you are seeing when archiving two
> times the same segment.
> --
> Michael
>


-- 

*Andre Piwoni*

Sr. Software Developer, BI/Database

*Web*MD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com


Re: PostgreSQL (linux) configuration with GSSAPI to a Windows domain

2019-02-28 Thread Andre Piwoni
I think setting up PAM authentication with AD on Linux server joined to
domain via realm SSSD was much easier and transparent.

Something like this worked for me to create SPN mapping and keytab in one
command without need to use UPPERCASE for POSTGRES:
ktpass -out postgres.keytab -princ POSTGRES/ubuntu.ad.corp@ad.corp.com
-mapUser AD\POSTGRES -pass 'thepassword' -mapOp add -crypto ALL -ptype
KRB5_NT_PRINCIPAL

pg_hba.conf
host all all 0.0.0.0/0 gss gss include_realm=0 krb_realm=AD.CORP.COM
ktb_realm should not be needed since you have one in your krb5.conf

postgresql.conf
krb_server_keyfile = '/etc/postgresql/9.6/main/postgres.keytab'
#krb_caseins_users = off

kinit ubunt...@ad.corp.com
psql.exe -h 192.168.1.143 -U ubuntupg

klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: ubunt...@ad.corp.com

Valid starting   Expires  Service principal
08/03/2018 22:28:47  08/04/2018 08:28:47  krbtgt/ad.corp@ad.corp.com
renew until 08/10/2018 22:28:42
08/03/2018 22:29:00  08/04/2018 08:28:47  POSTGRES/
ubuntu.ad.corp@ad.corp.com
renew until 08/10/2018 22:28:42

On Thu, Feb 28, 2019 at 2:54 PM Jean-Philippe Chenel 
wrote:

> I'm trying to configure authentication between PostgreSQL database server
> on linux and Windows Active Directory.
>
> *First part of configuration is working but when I'm trying to
> authenticate from Windows client, it is not working with message: Can't
> obtain database list from the server. SSPI continuation error. The
> specified target is unknown or unreachable (80090303)*
>
> *On Windows:*
>
> Domain is AD.CORP.COM
>
> Host is: WIN.AD.CORP.COM, IP is 192.168.1.173
>
> *On Linux (Ubuntu 16.04)*
>
> hostname is UBUNTU.ad.corp.com, IP is 192.168.1.143
>
> DNS are configured to reach the AD sytem (.173)
>
> PostgreSQL 9.6.9 on x86_64-pc-linux-gnu (Ubuntu 9.6.9-2.pgdg16.04+1),
> compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
>
> I've created à service user called POSTGRES and a normal user in AD called
> ubuntupg.
>
> Finally I've created the SPN:
>
> setspn -A POSTGRES/UBUNTU.ad.corp.com POSTGRES
>
> Generated the keytab to put on the linux server:
>
> ktpass -out postgres.keytab -princ POSTGRES/ubuntu.ad.corp@ad.corp.com 
> -mapUser POSTGRES -pass 'thepassword' -crypto all -ptype KRB5_NT_PRINCIPAL
>
> On the linux /etc/krb5.conf:
>
> [libdefaults]
>   debug=true
>   default_realm = AD.CORP.COM
>   dns_lookup_realm = false
>   dns_lookup_kdc = false
>   ticket_lifetime = 24h
>   renew_lifetime = 7d
>   forwardable = true
>
> [realms]
>   AD.CORP.COM = {
> kdc = WIN.AD.CORP.COM
>   }
>
> [domain_realm]
>   ad.corp.com = AD.CORP.COM
>   .ad.corp.com = AD.CORP.COM
>
> Making this command work and klist return a ticket:
>
> kinit -V -k -t /etc/postgresql/9.6/main/postgres.keytab 
> POSTGRES/ubuntu.ad.corp@ad.corp.com
>
> klist -k /etc/postgresql/9.6/main/postgres.keytab
>
> POSTGRES/ubuntu.ad.corp@ad.corp.com
>
> Here is the added onfiguration to postgresql.conf
>
> krb_server_keyfile = '/etc/postgresql/9.6/main/postgres.keytab'
>
> Here is the configuration of pg_hba.conf
>
> hostall  all0.0.0.0/0 gss
>
> Up to here, all is working as expected, kinit with ubuntupg is also
> working well. ubuntupg and ubunt...@ad.corp.com is also created on the
> database. The probleme is when I try, from a Windows client, connecting to
> the DB.
>
> psql.exe -h 192.168.1.143 -U ubuntupg
>
> *Can't obtain database list from the server. SSPI continuation error. The
> specified target is unknown or unreachable (80090303)*
>
> PostgreSQL log file show:
>
> 2019-02-28 14:02:54.178 EST [6747] [unknown]@[unknown] LOG:  0: 
> connection received: host=192.168.1.176 port=57254
> 2019-02-28 14:02:54.178 EST [6747] [unknown]@[unknown] LOCATION:  
> BackendInitialize, postmaster.c:4188
> 2019-02-28 14:02:54.331 EST [6747] ubuntupg@ubuntupg FATAL:  28000: GSSAPI 
> authentication failed for user "ubuntupg"
> 2019-02-28 14:02:54.331 EST [6747] ubuntupg@ubuntupg DETAIL:  Connection 
> matched pg_hba.conf line 92: "hostall  all
> 0.0.0.0/0 gss"
> 2019-02-28 14:02:54.331 EST [6747] ubuntupg@ubuntupg LOCATION:  auth_failed, 
> auth.c:307
>
> psql.exe -h 192.168.1.143 -U ubunt...@ad.corp.com
>
> 2019-02-28 14:06:35.992 EST [6866] [unknown]@[unknown] LOG:  0: 
> connection received: host=192.168.1.176 port=57282
> 2019-02-28 14:06:35.992 EST [6866] [unknown]@[unknown] LOCATION:  
> BackendInitialize, postmaster.c:4188
> 2019-02-28 14:06:36.148 EST [6866] ubunt...@ad.corp.com@ubunt...@ad.corp.com 
> FATAL:  28000: GSSAPI authentication failed for user "ubunt...@ad.corp.com"
> 2019-02-28 14:06:36.148 EST [6866] ubunt...@ad.corp.com@ubunt...@ad.corp.com 
> DETAIL:  Connection matched pg_hba.conf line 96: "hostall  
> all0.0.0.0/0 gss"
> 2019-02-28 14:06:36.148 EST [6866] ubunt...@ad.corp.com@ubunt...@ad.corp.com 
> LOCATION:  auth_failed, auth.c:307
>
> Thank you v