Correct setup for 3+ node cluster with streaming replication and WAL archiving
hive/00010005 2019-02-14 11:13:19.199 PST [1579] DEBUG: attempting to remove WAL segments older than log file 0006 2019-02-14 11:13:19.200 PST [1579] DEBUG: recycled write-ahead log file "00010006" 2019-02-14 11:13:20.201 PST [2346] LOG: archive command failed with exit code 1 OTHER SLAVE 2019-02-14 11:13:14.723 PST [5807] FATAL: could not connect to the primary server: could not connect to server: Connection refused Is the server running on host "pg-hdp-node1.kitchen.local" (172.28.128.54) and accepting TCP/IP connections on port 5432? 2019-02-14 11:13:16.072 PST [1839] DEBUG: performing replication slot checkpoint 2019-02-14 11:13:16.077 PST [1839] DEBUG: attempting to remove WAL segments older than log file 0005 2019-02-14 11:13:16.078 PST [1839] DEBUG: recycled write-ahead log file "00010005" 2019-02-14 11:13:19.412 PST [1838] DEBUG: switched WAL source from stream to archive after failure 2019-02-14 11:13:19.412 PST [1838] DEBUG: invalid resource manager ID 64 at 0/798 2019-02-14 11:13:19.412 PST [1838] DEBUG: switched WAL source from archive to stream after failure 2019-02-14 11:13:24.418 PST [1839] LOG: shutting down 2019-02-14 11:13:24.427 PST [1834] LOG: database system is shut down 2019-02-14 11:13:25.096 PST [5839] LOG: database system was shut down in recovery at 2019-02-14 11:13:24 PST 2019-02-14 11:13:25.096 PST [5839] DEBUG: standby_mode = 'on' 2019-02-14 11:13:25.096 PST [5839] DEBUG: primary_conninfo = 'user=replication password=password host=pg-hdp-node3.kitchen.local port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any' 2019-02-14 11:13:25.096 PST [5839] DEBUG: restore_command = 'cp /mnt/pgsql/archive/%f %p' 2019-02-14 11:13:25.096 PST [5839] LOG: entering standby mode cp: cannot stat ‘/mnt/pgsql/archive/00010007’: No such file or directory 2019-02-14 11:13:25.102 PST [5839] DEBUG: could not restore file "00010007" from archive: child process exited with exit code 1 2019-02-14 11:13:25.149 PST [5839] DEBUG: switched WAL source from stream to archive after failure cp: cannot stat ‘/mnt/pgsql/archive/00010007’: No such file or directory 2019-02-14 11:13:25.154 PST [5839] DEBUG: could not restore file "00010007" from archive: child process exited with exit code 1 2019-02-14 11:13:25.154 PST [5839] DEBUG: invalid resource manager ID 64 at 0/798 *Andre Piwoni*
Promoted slave tries to archive previously archived WAL file
I have master and slave running with the following contents of their pg_wal directories and archivedir: ls -l /mnt/pgsql/archive/ -rw-rw-rw-. 1 root root 16777216 Feb 15 09:39 00010001 -rw-rw-rw-. 1 root root 16777216 Feb 15 09:39 00010002 -rw-rw-rw-. 1 root root 302 Feb 15 09:39 00010002.0028.backup pg-hdp-node1.kitchen.local /var/lib/pgsql/10/data/pg_wal/: -rw---. 1 postgres postgres 16777216 Feb 15 09:39 00010002 -rw---. 1 postgres postgres 302 Feb 15 09:39 00010002.0028.backup -rw---. 1 postgres postgres 16777216 Feb 15 09:44 00010003 -rw---. 1 postgres postgres 16777216 Feb 15 09:39 00010004 drwx--. 2 postgres postgres 96 Feb 15 09:44 archive_status /var/lib/pgsql/10/data/pg_wal/archive_status: -rw---. 1 postgres postgres 0 Feb 15 09:39 00010002.0028.backup.done -rw---. 1 postgres postgres 0 Feb 15 09:39 00010002.done pg-hdp-node2.kitchen.local /var/lib/pgsql/10/data/pg_wal/: -rw---. 1 postgres root 16777216 Feb 15 09:39 00010002 -rw---. 1 postgres postgres 16777216 Feb 15 09:44 00010003 drwx--. 2 postgres root6 Feb 15 09:39 archive_status /var/lib/pgsql/10/data/pg_wal/archive_status: diff from secondary pg-hdp-node2.kitchen.local on /var/lib/pgsql/10/data/pg_wal/00010002 and /mnt/pgsql/archive/00010002 shows binary differences but as expected no differences for diff on primary pg-hdp-node1.kitchen.local Failover is performed and pg-hdp-node2.kitchen.local tries and fails to archive WAL segment 00010002 because it has been previously archived 2019-02-15 09:54:50.518 PST [780] DETAIL: The failed archive command was: test ! -f /mnt/pgsql/archive/00010002 && cp pg_wal/00010002 /mnt/pgsql/archive/00010002 Based on this thread https://www.postgresql.org/message-id/11b405a6-2176-9510-bf5b-ea9c0e860635%40pgmasters.net it is suggested to handle this case by reporting success but in my case contents are different. I would think that previously archived 00010002 is the right WAL segment. So my questions are as follows: Why WAL segments differ? How should this be resolved on the new primary? -- *Andre Piwoni*
Re: Promoted slave tries to archive previously archived WAL file
Ok. I think I uncovered a bug. My slave nodes were created using pg_basebackup with --wal-method=stream. If I understand right this option streams WAL files generated during backup and this WAL file was 00010002 but its contents were different from what was on the primary and in WAL archive. When I changed --wal-method=fetch which collects WAL files generated during the backup at the end then diff did not detect any changes. When I failover I don't seem to have the issue with new primary to archive. It seems streaming WAL segments created during backup produced corrupt file based on diff. On Fri, Feb 15, 2019 at 10:23 AM Andre Piwoni wrote: > I have master and slave running with the following contents of their > pg_wal directories and archivedir: > > ls -l /mnt/pgsql/archive/ > -rw-rw-rw-. 1 root root 16777216 Feb 15 09:39 00010001 > -rw-rw-rw-. 1 root root 16777216 Feb 15 09:39 00010002 > -rw-rw-rw-. 1 root root 302 Feb 15 09:39 > 00010002.0028.backup > > pg-hdp-node1.kitchen.local > /var/lib/pgsql/10/data/pg_wal/: > -rw---. 1 postgres postgres 16777216 Feb 15 09:39 > 00010002 > -rw---. 1 postgres postgres 302 Feb 15 09:39 > 00010002.0028.backup > -rw---. 1 postgres postgres 16777216 Feb 15 09:44 > 00010003 > -rw---. 1 postgres postgres 16777216 Feb 15 09:39 > 00010004 > drwx--. 2 postgres postgres 96 Feb 15 09:44 archive_status > /var/lib/pgsql/10/data/pg_wal/archive_status: > -rw---. 1 postgres postgres 0 Feb 15 09:39 > 00010002.0028.backup.done > -rw---. 1 postgres postgres 0 Feb 15 09:39 > 00010002.done > > pg-hdp-node2.kitchen.local > /var/lib/pgsql/10/data/pg_wal/: > -rw---. 1 postgres root 16777216 Feb 15 09:39 > 00010002 > -rw---. 1 postgres postgres 16777216 Feb 15 09:44 > 00010003 > drwx--. 2 postgres root6 Feb 15 09:39 archive_status > /var/lib/pgsql/10/data/pg_wal/archive_status: > > diff from secondary pg-hdp-node2.kitchen.local on > /var/lib/pgsql/10/data/pg_wal/00010002 and > /mnt/pgsql/archive/00010002 shows binary differences but as > expected no differences for diff on primary pg-hdp-node1.kitchen.local > > Failover is performed and pg-hdp-node2.kitchen.local tries and fails to > archive WAL segment 00010002 because it has been previously > archived > 2019-02-15 09:54:50.518 PST [780] DETAIL: The failed archive command was: > test ! -f /mnt/pgsql/archive/00010002 && cp > pg_wal/00010002 /mnt/pgsql/archive/00010002 > > Based on this thread > https://www.postgresql.org/message-id/11b405a6-2176-9510-bf5b-ea9c0e860635%40pgmasters.net > it is suggested to handle this case by reporting success but in my case > contents are different. I would think that previously archived > 00010002 is the right WAL segment. > > So my questions are as follows: > > Why WAL segments differ? > How should this be resolved on the new primary? > -- > > *Andre Piwoni* > -- *Andre Piwoni* Sr. Software Developer, BI/Database *Web*MD Health Services Mobile: 801.541.4722 www.webmdhealthservices.com
Re: Promoted slave tries to archive previously archived WAL file
I call pg_ctl -D /var/lib/pgsql/10/data promote to upgrade slave to master when failover happens I repoint slave to the master by stopping it, updating recovery.conf and restarting it. Let me know if I'm doing it wrong. Thank you for clarifying that contents of WAL files from stream and fetch method should be different. The fact remains that WAL file created during the backup by stream method does not have extension .done and new master repeatedly fails to archive it because previous master archived it. This does not happen in fetch mode where file created during backup has .done extension. On Sun, Feb 17, 2019 at 5:27 PM Michael Paquier wrote: > On Fri, Feb 15, 2019 at 12:03:39PM -0800, Andre Piwoni wrote: > > My slave nodes were created using pg_basebackup with --wal-method=stream. > > If I understand right this option streams WAL files generated during > backup > > and this WAL file was 00010002 but its contents were > > different from what was on the primary and in WAL archive. When I changed > > --wal-method=fetch which collects WAL files generated during the backup > at > > the end then diff did not detect any changes. When I failover I don't > seem > > to have the issue with new primary to archive. > > > > It seems streaming WAL segments created during backup produced corrupt > file > > based on diff. > > Nope, --wal-method=stream and fetch behave as they are designed for. > "fetch" would include all the contents of pg_wal/ as part of a > BASE_BACKUP command using the replication protocol while "stream" > would make the base backup avoid copying all WAL files, copying them > in parallel with a secondary process which uses the same facility as > pg_receiverwal, in which case the set of files in > pg_wal/archive_status/ has little meaning. > > I am also curious about the way you use to do failovers. Let me > guess: you stop the standby, delete its recovery.conf and then restart > the former standby? This would prevent a timeline jump at promotion > which would explain the conflicts you are seeing when archiving two > times the same segment. > -- > Michael > -- *Andre Piwoni* Sr. Software Developer, BI/Database *Web*MD Health Services Mobile: 801.541.4722 www.webmdhealthservices.com
Re: PostgreSQL (linux) configuration with GSSAPI to a Windows domain
I think setting up PAM authentication with AD on Linux server joined to domain via realm SSSD was much easier and transparent. Something like this worked for me to create SPN mapping and keytab in one command without need to use UPPERCASE for POSTGRES: ktpass -out postgres.keytab -princ POSTGRES/ubuntu.ad.corp@ad.corp.com -mapUser AD\POSTGRES -pass 'thepassword' -mapOp add -crypto ALL -ptype KRB5_NT_PRINCIPAL pg_hba.conf host all all 0.0.0.0/0 gss gss include_realm=0 krb_realm=AD.CORP.COM ktb_realm should not be needed since you have one in your krb5.conf postgresql.conf krb_server_keyfile = '/etc/postgresql/9.6/main/postgres.keytab' #krb_caseins_users = off kinit ubunt...@ad.corp.com psql.exe -h 192.168.1.143 -U ubuntupg klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: ubunt...@ad.corp.com Valid starting Expires Service principal 08/03/2018 22:28:47 08/04/2018 08:28:47 krbtgt/ad.corp@ad.corp.com renew until 08/10/2018 22:28:42 08/03/2018 22:29:00 08/04/2018 08:28:47 POSTGRES/ ubuntu.ad.corp@ad.corp.com renew until 08/10/2018 22:28:42 On Thu, Feb 28, 2019 at 2:54 PM Jean-Philippe Chenel wrote: > I'm trying to configure authentication between PostgreSQL database server > on linux and Windows Active Directory. > > *First part of configuration is working but when I'm trying to > authenticate from Windows client, it is not working with message: Can't > obtain database list from the server. SSPI continuation error. The > specified target is unknown or unreachable (80090303)* > > *On Windows:* > > Domain is AD.CORP.COM > > Host is: WIN.AD.CORP.COM, IP is 192.168.1.173 > > *On Linux (Ubuntu 16.04)* > > hostname is UBUNTU.ad.corp.com, IP is 192.168.1.143 > > DNS are configured to reach the AD sytem (.173) > > PostgreSQL 9.6.9 on x86_64-pc-linux-gnu (Ubuntu 9.6.9-2.pgdg16.04+1), > compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit > > I've created à service user called POSTGRES and a normal user in AD called > ubuntupg. > > Finally I've created the SPN: > > setspn -A POSTGRES/UBUNTU.ad.corp.com POSTGRES > > Generated the keytab to put on the linux server: > > ktpass -out postgres.keytab -princ POSTGRES/ubuntu.ad.corp@ad.corp.com > -mapUser POSTGRES -pass 'thepassword' -crypto all -ptype KRB5_NT_PRINCIPAL > > On the linux /etc/krb5.conf: > > [libdefaults] > debug=true > default_realm = AD.CORP.COM > dns_lookup_realm = false > dns_lookup_kdc = false > ticket_lifetime = 24h > renew_lifetime = 7d > forwardable = true > > [realms] > AD.CORP.COM = { > kdc = WIN.AD.CORP.COM > } > > [domain_realm] > ad.corp.com = AD.CORP.COM > .ad.corp.com = AD.CORP.COM > > Making this command work and klist return a ticket: > > kinit -V -k -t /etc/postgresql/9.6/main/postgres.keytab > POSTGRES/ubuntu.ad.corp@ad.corp.com > > klist -k /etc/postgresql/9.6/main/postgres.keytab > > POSTGRES/ubuntu.ad.corp@ad.corp.com > > Here is the added onfiguration to postgresql.conf > > krb_server_keyfile = '/etc/postgresql/9.6/main/postgres.keytab' > > Here is the configuration of pg_hba.conf > > hostall all0.0.0.0/0 gss > > Up to here, all is working as expected, kinit with ubuntupg is also > working well. ubuntupg and ubunt...@ad.corp.com is also created on the > database. The probleme is when I try, from a Windows client, connecting to > the DB. > > psql.exe -h 192.168.1.143 -U ubuntupg > > *Can't obtain database list from the server. SSPI continuation error. The > specified target is unknown or unreachable (80090303)* > > PostgreSQL log file show: > > 2019-02-28 14:02:54.178 EST [6747] [unknown]@[unknown] LOG: 0: > connection received: host=192.168.1.176 port=57254 > 2019-02-28 14:02:54.178 EST [6747] [unknown]@[unknown] LOCATION: > BackendInitialize, postmaster.c:4188 > 2019-02-28 14:02:54.331 EST [6747] ubuntupg@ubuntupg FATAL: 28000: GSSAPI > authentication failed for user "ubuntupg" > 2019-02-28 14:02:54.331 EST [6747] ubuntupg@ubuntupg DETAIL: Connection > matched pg_hba.conf line 92: "hostall all > 0.0.0.0/0 gss" > 2019-02-28 14:02:54.331 EST [6747] ubuntupg@ubuntupg LOCATION: auth_failed, > auth.c:307 > > psql.exe -h 192.168.1.143 -U ubunt...@ad.corp.com > > 2019-02-28 14:06:35.992 EST [6866] [unknown]@[unknown] LOG: 0: > connection received: host=192.168.1.176 port=57282 > 2019-02-28 14:06:35.992 EST [6866] [unknown]@[unknown] LOCATION: > BackendInitialize, postmaster.c:4188 > 2019-02-28 14:06:36.148 EST [6866] ubunt...@ad.corp.com@ubunt...@ad.corp.com > FATAL: 28000: GSSAPI authentication failed for user "ubunt...@ad.corp.com" > 2019-02-28 14:06:36.148 EST [6866] ubunt...@ad.corp.com@ubunt...@ad.corp.com > DETAIL: Connection matched pg_hba.conf line 96: "hostall > all0.0.0.0/0 gss" > 2019-02-28 14:06:36.148 EST [6866] ubunt...@ad.corp.com@ubunt...@ad.corp.com > LOCATION: auth_failed, auth.c:307 > > Thank you v