very high replay_lag on 3-node cluster
Hello, In my previous post[1] on this list I brought up an issue with long running checkpoints. I reduced checkpoint_timeout to a more reasonable value (15m down from 60m) and forced/immediate checkpoints now complete mostly in under a minute. This thread and another one[2] on the Clusterlabs mailinglist also helped me understand more about how PostgreSQL internals work, thanks everyone! Now to my current issue: I took the advice to add more monitoring on replay lag (using pg_last_xact_replay_timestamp) and things are not looking good. Last night replication lagged by almost 6 hours on one of the nodes[3], but eventually caught up. As you can see in that screenshot, ph-sql-03 is consistently slower to replay than ph-sql-05 (ph-sql-04 is the current master) and there happen to be different SSD's in ph-sql-03 (Crucial MX300 vs Crucial MX500 in the other two), which makes me think this is IO related. When I check the replay_lag column of pg_stat_replication, the numbers are consistent with the data from pg_last_xact_replay_timestamp: postgres=# SELECT application_name, replay_lag FROM pg_stat_replication; application_name | replay_lag --+- ph-sql-03| 00:15:16.179952 ph-sql-05| 00:10:01.078163 Currently this doesn't present an operational issue, as the slaves aren't used by applications (still waiting for development to make the necessary changes). So there are no queries running at all on the slaves apart from the occasional monitoring. Cluster specifications: all nodes: - filesystem: ZFS stripe of mirrors - 2* CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores total) - 128 GB RAM ph-sql-03: 8* Crucial MX300 1050MB, underprovisioned to 1TB ph-sql-0{4,5}: 8* Crucial MX500 1TB postgresql.conf with GUCs changed from default: shared_buffers = 8GB work_mem = 64MB maintenance_work_mem = 2GB autovacuum_work_mem = 1GB effective_io_concurrency = 200 max_worker_processes = 50 max_parallel_maintenance_workers = 8 max_parallel_workers_per_gather = 8 max_parallel_workers = 40 wal_level = replica synchronous_commit = off full_page_writes = off wal_log_hints = on wal_buffers = 128MB checkpoint_timeout = 15min max_wal_size = 8GB min_wal_size = 1GB checkpoint_completion_target = 0.9 archive_mode = on archive_command = 'pgbackrest --stanza=pgdb2 archive-push %p' max_wal_senders = 10 wal_keep_segments = 20 hot_standby = on hot_standby_feedback = on random_page_cost = 1.5 effective_cache_size = 48GB default_statistics_target = 500 shared_preload_libraries = 'timescaledb, pg_cron, pg_prewarm' # (change requires restart) max_locks_per_transaction = 512 What are possible reasons for the high replay_lag? Is my storage just too slow? Are there any tunables available? [1] https://www.postgresql.org/message-id/flat/CAEkBuzeno6ztiM1g4WdzKRJFgL8b2nfePNU%3Dq3sBiEZUm-D-sQ%40mail.gmail.com [2] https://lists.clusterlabs.org/pipermail/users/2019-July/025967.html [3] https://ibb.co/0sncjBZ Thank you, Tiemen Ruiten
Re: very high replay_lag on 3-node cluster
Anyone have an idea? Thanks very much in advance for any reply. On Fri, Jul 19, 2019 at 1:46 PM Tiemen Ruiten wrote: > Hello, > > In my previous post[1] on this list I brought up an issue with long > running checkpoints. I reduced checkpoint_timeout to a more reasonable > value (15m down from 60m) and forced/immediate checkpoints now complete > mostly in under a minute. This thread and another one[2] on the Clusterlabs > mailinglist also helped me understand more about how PostgreSQL internals > work, thanks everyone! > > Now to my current issue: I took the advice to add more monitoring on > replay lag (using pg_last_xact_replay_timestamp) and things are not looking > good. Last night replication lagged by almost 6 hours on one of the > nodes[3], but eventually caught up. As you can see in that screenshot, > ph-sql-03 is consistently slower to replay than ph-sql-05 (ph-sql-04 is the > current master) and there happen to be different SSD's in ph-sql-03 > (Crucial MX300 vs Crucial MX500 in the other two), which makes me think > this is IO related. > > When I check the replay_lag column of pg_stat_replication, the numbers are > consistent with the data from pg_last_xact_replay_timestamp: > > postgres=# SELECT application_name, replay_lag FROM pg_stat_replication; > application_name | replay_lag > --+- > ph-sql-03| 00:15:16.179952 > ph-sql-05| 00:10:01.078163 > > Currently this doesn't present an operational issue, as the slaves aren't > used by applications (still waiting for development to make the necessary > changes). So there are no queries running at all on the slaves apart from > the occasional monitoring. > > Cluster specifications: > all nodes: > - filesystem: ZFS stripe of mirrors > - 2* CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores total) > - 128 GB RAM > ph-sql-03: 8* Crucial MX300 1050MB, underprovisioned to 1TB > ph-sql-0{4,5}: 8* Crucial MX500 1TB > > postgresql.conf with GUCs changed from default: > > shared_buffers = 8GB > work_mem = 64MB > maintenance_work_mem = 2GB > autovacuum_work_mem = 1GB > effective_io_concurrency = 200 > max_worker_processes = 50 > max_parallel_maintenance_workers = 8 > max_parallel_workers_per_gather = 8 > max_parallel_workers = 40 > wal_level = replica > synchronous_commit = off > full_page_writes = off > wal_log_hints = on > wal_buffers = 128MB > checkpoint_timeout = 15min > max_wal_size = 8GB > min_wal_size = 1GB > checkpoint_completion_target = 0.9 > archive_mode = on > archive_command = 'pgbackrest --stanza=pgdb2 archive-push %p' > max_wal_senders = 10 > wal_keep_segments = 20 > hot_standby = on > hot_standby_feedback = on > random_page_cost = 1.5 > effective_cache_size = 48GB > default_statistics_target = 500 > shared_preload_libraries = 'timescaledb, pg_cron, pg_prewarm' # (change > requires restart) > max_locks_per_transaction = 512 > > What are possible reasons for the high replay_lag? Is my storage just too > slow? Are there any tunables available? > > [1] > https://www.postgresql.org/message-id/flat/CAEkBuzeno6ztiM1g4WdzKRJFgL8b2nfePNU%3Dq3sBiEZUm-D-sQ%40mail.gmail.com > > [2] https://lists.clusterlabs.org/pipermail/users/2019-July/025967.html > [3] https://ibb.co/0sncjBZ > > Thank you, > > Tiemen Ruiten >
Re: very high replay_lag on 3-node cluster
On Mon, Jul 22, 2019 at 11:28 AM Jehan-Guillaume (ioguix) de Rorthais < iog...@free.fr> wrote: > Hi, > > On Mon, 22 Jul 2019 11:05:57 +0200 > Tiemen Ruiten wrote: > [...] > > > Now to my current issue: I took the advice to add more monitoring on > > > replay lag (using pg_last_xact_replay_timestamp) and things are not > looking > > > good. Last night replication lagged by almost 6 hours on one of the > > > nodes[3], but eventually caught up. As you can see in that screenshot, > > > ph-sql-03 is consistently slower to replay than ph-sql-05 (ph-sql-04 > is the > > > current master) and there happen to be different SSD's in ph-sql-03 > > > (Crucial MX300 vs Crucial MX500 in the other two), which makes me think > > > this is IO related. > > Such a difference is quite surprising. Moreover, I suppose you have some > caching in front of disks (either RAID or SAN?). Could you describe your > disk > stack with more details? > > Do you have any detailed metrics about disks and network IO to share? > > The network is the same for both nodes? > Hi! All nodes have only local storage. The filesystem is ZFS, which has its own builtin caching (ARC/L2ARC & ZIL). There are no hardware caching devices present. The disks are connected using a Supermicro AOC-S3008L-L8E HBA (shown as LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) with lspci). This is the layout of the pool: [root@ph-sql-04 ter]# zpool status pool: tank state: ONLINE scan: scrub repaired 0B in 0 days 00:07:51 with 0 errors on Tue Jul 16 03:17:52 2019 config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sdb ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 sdg ONLINE 0 0 0 sdh ONLINE 0 0 0 I've attached a graph of network IO on all servers. The network config is identical for all three nodes: 2x bonded gigabit connection to the same stacked switch pair. Currently I don't have much metrics on raw disk IO, I can add some monitoring items and report those tomorrow if that would be useful. I do have a lot of metrics available related to PostgreSQL (buffers written/read, checkpoints, rows fetched/updated/inserted etc.)
Re: self-made certs not quite right
Hello Rob, With root.crt in ~/.postgresql, testing the jdbc connection from a > tomcat server generates this failure (again either localhost or > $(hostname)) > Blow out on db connection to jdbc:postgresql://localhost:5432/postgres; > SSL error: PKIX path building failed: > sun.security.provider.certpath.SunCertPathBuilderException: unable to > find valid certification path to requested target > org.postgresql.util.PSQLException: SSL error: PKIX path building failed: > sun.security.provider.certpath.SunCertPathBuilderException: unable to > find valid certification path to requested target > Java has its own certificate keystore, you would need to add your certificate to it: https://tomcat.apache.org/tomcat-8.5-doc/ssl-howto.html Hope this helps. -- Tiemen Ruiten Infrastructure Engineer
checkpoints taking much longer than expected
Hello, I setup a new 3-node cluster with the following specifications: 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2*20 cores) 128 GB RAM 8x Crucial MX500 1TB SSD's FS is ZFS, the dataset with the PGDATA directory on it has the following properties (only non-default listed): NAMEPROPERTY VALUE SOURCE tank/pgsql used 234G - tank/pgsql available 3.28T - tank/pgsql referenced234G - tank/pgsql compressratio 2.68x - tank/pgsql compression lz4inherited from tank tank/pgsql atime offinherited from tank tank/pgsql canmount on local tank/pgsql xattr sa inherited from tank My postgresql.conf (only changed from default listed): hba_file = '/var/lib/pgsql/pg_hba.conf' listen_addresses = '*' max_connections = 800 shared_buffers = 8GB work_mem = 64MB maintenance_work_mem = 2GB autovacuum_work_mem = 1GB dynamic_shared_memory_type = posix effective_io_concurrency = 200 max_worker_processes = 50 max_parallel_maintenance_workers = 8 max_parallel_workers_per_gather = 8 max_parallel_workers = 40 wal_level = replica synchronous_commit = off full_page_writes = on wal_log_hints = on wal_buffers = 128MB checkpoint_timeout = 60min max_wal_size = 8GB min_wal_size = 1GB checkpoint_completion_target = 0.9 archive_mode = on archive_command = 'pgbackrest --stanza=pgdb2 archive-push %p' max_wal_senders = 10 wal_keep_segments = 20 hot_standby = on hot_standby_feedback = on random_page_cost = 1.5 effective_cache_size = 48GB default_statistics_target = 500 # range 1-1 idle_in_transaction_session_timeout = 30min # in milliseconds, 0 is disabled shared_preload_libraries = 'timescaledb, pg_cron' max_locks_per_transaction = 512 timescaledb.max_background_workers = 8 My problem is that checkpoints are taking a long time. Even when I run a few manual checkpoints one after the other, they keep taking very long, up to 10 minutes: 2019-06-14 15:21:10.351 CEST [23657] LOG: checkpoint starting: immediate force wait 2019-06-14 15:25:57.655 CEST [23657] LOG: checkpoint complete: wrote 139831 buffers (13.3%); 0 WAL file(s) added, 148 removed, 40 recycled; write=284.751 s, sync=0.532 s, total=287.304 s; sync files=537, longest=0.010 s, average=0.000 s; distance=2745065 kB, estimate=2841407 kB 2019-06-14 15:26:01.988 CEST [23657] LOG: checkpoint starting: immediate force wait 2019-06-14 15:30:30.430 CEST [23657] LOG: checkpoint complete: wrote 238322 buffers (22.7%); 0 WAL file(s) added, 0 removed, 172 recycled; write=264.794 s, sync=0.415 s, total=268.441 s; sync files=378, longest=0.011 s, average=0.001 s; distance=2834018 kB, estimate=2840668 kB 2019-06-14 15:30:44.097 CEST [23657] LOG: checkpoint starting: immediate force wait 2019-06-14 15:37:01.438 CEST [23657] LOG: checkpoint complete: wrote 132286 buffers (12.6%); 0 WAL file(s) added, 54 removed, 96 recycled; write=366.614 s, sync=2.975 s, total=377.341 s; sync files=467, longest=0.095 s, average=0.006 s; distance=2444291 kB, estimate=2801030 kB What is going on? It doesn't seem like normal behaviour?
Re: checkpoints taking much longer than expected
On Fri, Jun 14, 2019 at 5:43 PM Stephen Frost wrote: > Greetings, > > * Tiemen Ruiten (t.rui...@tech-lab.io) wrote: > > checkpoint_timeout = 60min > > That seems like a pretty long timeout. > My reasoning was that a longer recovery time to avoid writes would be acceptable because there are two more nodes in the cluster to fall back on in case of emergency. > > > My problem is that checkpoints are taking a long time. Even when I run a > > few manual checkpoints one after the other, they keep taking very long, > up > > to 10 minutes: > > You haven't said *why* this is an issue... Why are you concerned with > how long it takes to do a checkpoint? > During normal operation I don't mind that it takes a long time, but when performing maintenance I want to be able to gracefully bring down the master without long delays to promote one of the standby's. > > The time information is all there and it tells you what it's doing and > how much had to be done... If you're unhappy with how long it takes to > write out gigabytes of data and fsync hundreds of files, talk to your > storage people... > I am the storage people too :) > > Thanks, > > Stephen >
Re: checkpoints taking much longer than expected
On Sun, Jun 16, 2019 at 8:57 PM Alvaro Herrera wrote: > > Note that Joyent ended up proposing patches to fix their performance > problem (and got them committed). Maybe it would be useful for Tiemen > to try that code? (That commit cherry-picks cleanly on REL_11_STABLE.) > Interesting! The performance improvements in that thread look like it's worth trying. Is this patch likely to make it to an 11.x release?
Re: checkpoints taking much longer than expected
On Sun, Jun 16, 2019 at 7:30 PM Stephen Frost wrote: > Ok, so you want fewer checkpoints because you expect to failover to a > replica rather than recover the primary on a failure. If you're doing > synchronous replication, then that certainly makes sense. If you > aren't, then you're deciding that you're alright with losing some number > of writes by failing over rather than recovering the primary, which can > also be acceptable but it's certainly much more questionable. > Yes, in our setup that's the case: a few lost transactions will have a negligible impact to the business. > I'm getting the feeling that your replicas are async, but it sounds like > you'd be better off with having at least one sync replica, so that you > can flip to it quickly. They are indeed async, we traded durability for performance here, because we can accept some lost transactions. > Alternatively, having a way to more easily make > the primary to accepting new writes, flush everything to the replicas, > report that it's completed doing so, to allow you to promote a replica > without losing anything, and *then* go through the process on the > primary of doing a checkpoint, would be kind of nice. > I suppose that would require being able to demote a master to a slave during runtime. That would definitely be nice-to-have. > > > Thanks, > > Stephen >
Re: checkpoints taking much longer than expected
On Sun, Jun 16, 2019 at 8:57 PM Alvaro Herrera wrote: > On 2019-Jun-14, Peter J. Holzer wrote: > > > There was a discussion about ZFS' COW behaviour and PostgreSQL reusing > > WAL files not being a good combination about a year ago: > > > https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w%40mail.gmail.com > > > > Maybe you have the same problem? > > Note that Joyent ended up proposing patches to fix their performance > problem (and got them committed). Maybe it would be useful for Tiemen > to try that code? (That commit cherry-picks cleanly on REL_11_STABLE.) > Unfortunately it doesn't compile: access/transam/xlog.o: In function `XLogFileInit': /home/ter/rpmbuild/BUILD/postgresql-11.3/src/backend/access/transam/xlog.c:3299: undefined reference to `pg_pwrite' collect2: error: ld returned 1 exit status > > > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > >
Re: Please help: pgAdmin 4 on Amazon Linux 2
Hi, Amazon Linux is not compatible with RHEL the way CentOS is and it's not a supported platform for the RPM installation: https://www.pgadmin.org/download/pgadmin-4-rpm/ The repo configuration probably uses some variables to create the download URLs and because the versions don't match, it results in a 404. You may have some luck downloading and installing the RPM directly, but you'll likely want to take a look at other installation methods on the page I linked if you want to install pgAdmin on Amazon Linux. On Fri, Aug 27, 2021 at 5:34 PM Blake McBride wrote: > Greetings, > > I am trying to install pgAdmin 4 on Amazon Linux 2. PostgreSQL is already > installed and working fine. I believe Amazon Linux 2 is based on RedHat. > > I am doing the following: > > [root@a-1lxumlkkw4mu4 ~]# rpm -i > https://ftp.postgresql.org/pub/pgadmin/pgadmin4/yum/pgadmin4-redhat-repo-2-1.noarch.rpm > warning: /var/tmp/rpm-tmp.ZEygli: Header V3 RSA/SHA256 Signature, key ID > 210976f2: NOKEY > [root@a-1lxumlkkw4mu4 ~]# yum install pgadmin4 > Loaded plugins: amzn_workspaces_filter_updates, halt_os_update_check, > priorities, update-motd > amzn2-core > | 3.7 kB 00:00:00 > > amzn2extra-GraphicsMagick1.3 > | 3.0 kB 00:00:00 > > amzn2extra-docker >| 3.0 kB 00:00:00 > > amzn2extra-epel >| 3.0 kB 00:00:00 > > amzn2extra-gimp >| 1.3 kB 00:00:00 > > amzn2extra-libreoffice > | 3.0 kB 00:00:00 > > amzn2extra-mate-desktop1.x > | 3.0 kB 00:00:00 > > epel/x86_64/metalink > | 15 kB 00:00:00 > > firefox >| 2.2 kB 00:00:00 > > google-chrome >| 1.3 kB 00:00:00 > > > https://ftp.postgresql.org/pub/pgadmin/pgadmin4/yum/redhat/rhel-2-x86_64/repodata/repomd.xml: > [Errno 14] HTTPS Error 404 - Not Found > Trying other mirror. > > > One of the configured repositories failed (pgadmin4), > and yum doesn't have enough cached data to continue. At this point the > only > safe thing yum can do is fail. There are a few ways to work "fix" this: > > 1. Contact the upstream for the repository and get them to fix the > problem. > > 2. Reconfigure the baseurl/etc. for the repository, to point to a > working > upstream. This is most often useful if you are using a newer > distribution release than is supported by the repository (and the > packages for the previous distribution release still work). > > 3. Run the command with the repository temporarily disabled > yum --disablerepo=pgAdmin4 ... > > 4. Disable the repository permanently, so yum won't use it by > default. Yum > will then just ignore the repository until you permanently enable > it > again or use --enablerepo for temporary usage: > > yum-config-manager --disable pgAdmin4 > or > subscription-manager repos --disable=pgAdmin4 > > 5. Configure the failing repository to be skipped, if it is > unavailable. > Note that yum will try to contact the repo. when it runs most > commands, > so will have to try and fail each time (and thus. yum will be be > much > slower). If it is a very temporary problem though, this is often a > nice > compromise: > > yum-config-manager --save > --setopt=pgAdmin4.skip_if_unavailable=true > > failure: repodata/repomd.xml from pgAdmin4: [Errno 256] No more mirrors to > try. > > https://ftp.postgresql.org/pub/pgadmin/pgadmin4/yum/redhat/rhel-2-x86_64/repodata/repomd.xml: > [Errno 14] HTTPS Error 404 - Not Found > [root@a-1lxumlkkw4mu4 ~]# > > > I have no idea how to fix this. Any help would sure be appreciated. > > Blake McBride > > -- Tiemen Ruiten Infrastructure Engineer