(Note: I have also attached the whole email for better readability of the logs)

Hello!

Our setup: 5 Node Patroni Cluster with PostgreSQL 16.9.
db1: current leader
db2: sync-replica
db3/4/5: replica

The replicas connect to the leader using the host IP of the leader. So there 
are 4 walsender for patroni, 1 sync and 3 async.

The patroni cluster utilizes a service IP-address (VIP). The VIP is used by all 
clients connecting to the current leader. These clients are:
- some web-apps doing normal DB queries (read/write)
- 2 barman backup clients using streaming replication
- 58 logical replication clients

Additionally we use https://github.com/EnterpriseDB/pg_failover_slots to sync 
and advance the logical replication slots on the replicas. The failover_slots 
plugin periodically connects to leader using the VIP.


We had a planned maintenance and wanted to switch the leader from db1 to db2:
12:32:18: patronictl switchover --leader db1 --candidate db2

So postmaster received the fast shutdown request from Patroni and started 
shutting down the client connection processes:
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 2748 
LOG:  received fast shutdown request
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 2748 
LOG:  aborting any active transactions
12:32:42 db1 patroni[1842780]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 
1842780 FATAL:  terminating connection due to administrator command
12:32:42 db1 patroni[1842702]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 
1842702 FATAL:  terminating connection due to administrator command
...
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.555 UTC 2748 
LOG:  background worker "logical replication launcher" (PID 99854) exited with 
exit code 1
12:32:42 db1 patroni[3935]: patroni_postgres: 2025-07-16 12:32:42.573 UTC 3935 
FATAL:  terminating background worker "pg_failover_slots worker" due to 
administrator command
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.626 UTC 2748 
LOG:  background worker "pg_failover_slots worker" (PID 3935) exited with exit 
code 1

At that time, all the normal clients were disconnected and we only see in the 
logs new connection attempts that were refused, eg:
12:32:42 db1 patroni[1846807]: patroni_postgres: 2025-07-16 12:32:42.796 UTC 
1846807 LOG:  connection received: host=xxxx:xxx:8::13 port=53704
12:32:42 db1 patroni[1846807]: patroni_postgres: 2025-07-16 12:32:42.800 UTC 
1846807 FATAL:  the database system is shutting down

12:32:43 db1 patroni[2988]: patroni_postgres: 2025-07-16 12:32:43.259 UTC 2988 
LOG:  shutting down


We might have missed some logs that might be real patroni or postgres logs :-(
12:33:39 db1 systemd-journald[1613]: Suppressed 10375 messages from 
patroni.service



Usually the switchover only takes a few seconds. After waiting a few minutes we 
became anxious and started debugging.


Using "ps -Alf|grep postgres" we saw that there were no more normal client 
connections, but still 58 logical replicaton walsender processes and 6 
streaming replication walsenders.
"top" revealed that the walsenders were eating CPU. I know for sure that I saw 
plenty logical replication walsenders that were eating 100% CPU, but I have not 
checked if also the streaming replication walsenders were eating CPU.
I also straced the checkpointer process, and that was doing no I/O, just 
looping or polling (I can't remember the details). Our Performance Monitoring 
showed a general drop of I/O activity during the shutdown.

So this sounds somehow like the walsenders were stuck in a state that in the 
postgres mailing lists is called "busy loop".

Then we decided to remove the VIP (which is used by the logical replication 
clients) from db1 to trigger replication errors:
systemctl stop vip-manager.service
12:50:11 db1 commandlog[2276907]: root/root : ip a d x.x.x.x/24 dev bond0.203
12:50:11 db1 commandlog[2276907]: root/root : ip -6 a d xxxx:xxx:x:x::10/64 dev 
bond0.203

Then, either caused by that or by something else, the connections timed out. We 
are using the default timeouts:
wal_sender_timeout: 60s

Frankly, the timeouts started not 60s after removing the VIP, but already after 
45 seconds. But I guess the timeout triggered a bit eralier as there was no 
more traffic on the walsender connections except for the status keepalive from 
the clients.

12:50:55 db1 patroni[2586700]: patroni_postgres: 2025-07-16 12:50:55.977 UTC 
2586700 LOG:  terminating walsender process due to replication timeout
12:50:55 db1 patroni[2586700]: patroni_postgres: 2025-07-16 12:50:55.977 UTC 
2586700 STATEMENT:  START_REPLICATION SLOT "barman_sbg" 7DFD/5F000000 TIMELINE 
22
12:50:56 db1 patroni[2579309]: patroni_postgres: 2025-07-16 12:50:56.027 UTC 
2579309 LOG:  terminating walsender process due to replication timeout
12:50:56 db1 patroni[2579309]: patroni_postgres: 2025-07-16 12:50:56.027 UTC 
2579309 STATEMENT:  START_REPLICATION SLOT "reg_tam1" LOGICAL 75D4/F7FDBB8 
(proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[827455]: patroni_postgres: 2025-07-16 12:50:56.212 UTC 
827455 LOG:  terminating walsender process due to replication timeout
12:50:56 db1 patroni[827455]: patroni_postgres: 2025-07-16 12:50:56.212 UTC 
827455 STATEMENT:  START_REPLICATION SLOT "reg_lis1" LOGICAL 7C86/593393A0 
(proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[3390803]: patroni_postgres: 2025-07-16 12:50:56.506 UTC 
3390803 LOG:  terminating walsender process due to replication timeout
12:50:56 db1 patroni[3390803]: patroni_postgres: 2025-07-16 12:50:56.506 UTC 
3390803 STATEMENT:  START_REPLICATION SLOT "reg_rey1" LOGICAL 77A5/205F5F0 
(proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[137758]: patroni_postgres: 2025-07-16 12:50:56.672 UTC 
137758 LOG:  terminating walsender process due to replication timeout
12:50:56 db1 patroni[137758]: patroni_postgres: 2025-07-16 12:50:56.672 UTC 
137758 STATEMENT:  START_REPLICATION SLOT "reg_tlv1" LOGICAL 7DC2/CEAB27A0 
(proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[2586624]: patroni_postgres: 2025-07-16 12:50:56.929 UTC 
2586624 LOG:  terminating walsender process due to replication timeout
12:50:56 db1 patroni[2586624]: patroni_postgres: 2025-07-16 12:50:56.929 UTC 
2586624 STATEMENT:  START_REPLICATION SLOT "barman" 7DFD/60000000 TIMELINE 22

...and additional 54 logical walsender that timed out and were terminated

12:51:05 db1 patroni[2988]: patroni_postgres: 2025-07-16 12:51:05.902 UTC 2988 
LOG:  checkpoint starting: shutdown immediate
12:51:05 db1 patroni[2988]: patroni_postgres: 2025-07-16 12:51:05.926 UTC 2988 
LOG:  checkpoint complete: wrote 11 buffers (0.0%); 0 WAL file(s) added, 0 
removed, 0 recycled; write=0.021 s, sync=0.001 s, total=0.026 s; sync files=9, 
longest=0.001 s, average=0.001 s; distance=55 kB, estimate=1080354 kB; 
lsn=7E53/18554B28, redo lsn=7E53/18554B28
12:51:07 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:51:07.159 UTC 2748 
LOG:  database system is shut down
12:51:07 db1 patroni[2389]: 2025-07-16 12:51:07,954 INFO: Got response from db2 
https://db2:8008/patroni: {"state": "running", "postmaster_start_time": 
"2025-07-16 12:17:25.758501+00:00", "role": "replica", "server_version": 
160009, "xlog": {"received_location": 138895355628448, "replayed_location": 
138895355628448, "replayed_timestamp": "2025-07-16 12:32:42.142954+00:00", 
"paused": false}, "sync_standby": true, "timeline": 22, "dcs_last_seen": 
1752670265, "database_system_identifier": "7386557262250076198", "patroni": 
{"version": "4.0.5", "scope": "regdns_patroni_cluster_1", "name": "db2"}}


After closing all walsenders (that used the VIP) due to timeout (58 logical 
senders and 2 streaming senders to barman), the database was shut down and the 
switchover happened.

I do not know for sure, but I think the walsenders to the patroni replicas, 
that were using the host IP and not the VIP, were still active wenn we removed 
the VIP. But it seems they have shut down nicely after all the other walsenders 
terminated due to timeout.

So what caused the hanging? We found bug 
https://www.postgresql.org/message-id/aD0Xvh4Tj_Vk7pCA%40paquier.xyz but it 
seems that only affected logical walsender in standby mode.

We can not reproduce it - after a few minutes we made another switchover back 
to db1 and this time it was done within a few seconds.

Do you have any ideas what could be the problem, how we could solve it and how 
to debug if it happens again?

Thanks
Klaus

--
Klaus Darilion, Head of Operations
nic.at GmbH, Jakob-Haringer-Straße 8/V
5020 Salzburg, Austria

(Note: I have also attached the whole email for better readability of the logs)

Hello!

Our setup: 5 Node Patroni Cluster with PostgreSQL 16.9.
db1: current leader
db2: sync-replica
db3/4/5: replica

The replicas connect to the leader using the host IP of the leader. So there 
are 4 walsender for patroni, 1 sync and 3 async.

The patroni cluster utilizes a service IP-address (VIP). The VIP is used by all 
clients connecting to the current leader. These clients are:
- some web-apps doing normal DB queries (read/write)
- 2 barman backup clients using streaming replication
- 58 logical replication clients

Additionally we use https://github.com/EnterpriseDB/pg_failover_slots to sync 
and advance the logical replication slots on the replicas. The failover_slots 
plugin periodically connects to leader using the VIP.


We had a planned maintenance and wanted to switch the leader from db1 to db2:
12:32:18: patronictl switchover --leader db1 --candidate db2

So postmaster received the fast shutdown request from Patroni and started 
shutting down the client connection processes:
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 2748 
LOG:  received fast shutdown request
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 2748 
LOG:  aborting any active transactions
12:32:42 db1 patroni[1842780]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 
1842780 FATAL:  terminating connection due to administrator command
12:32:42 db1 patroni[1842702]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 
1842702 FATAL:  terminating connection due to administrator command
...
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.555 UTC 2748 
LOG:  background worker "logical replication launcher" (PID 99854) exited with 
exit code 1
12:32:42 db1 patroni[3935]: patroni_postgres: 2025-07-16 12:32:42.573 UTC 3935 
FATAL:  terminating background worker "pg_failover_slots worker" due to 
administrator command
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.626 UTC 2748 
LOG:  background worker "pg_failover_slots worker" (PID 3935) exited with exit 
code 1

At that time, all the normal clients were disconnected and we only see in the 
logs new connection attempts that were refused, eg:
12:32:42 db1 patroni[1846807]: patroni_postgres: 2025-07-16 12:32:42.796 UTC 
1846807 LOG:  connection received: host=xxxx:xxx:8::13 port=53704
12:32:42 db1 patroni[1846807]: patroni_postgres: 2025-07-16 12:32:42.800 UTC 
1846807 FATAL:  the database system is shutting down

12:32:43 db1 patroni[2988]: patroni_postgres: 2025-07-16 12:32:43.259 UTC 2988 
LOG:  shutting down


We might have missed some logs that might be real patroni or postgres logs :-(
12:33:39 db1 systemd-journald[1613]: Suppressed 10375 messages from 
patroni.service



Usually the switchover only takes a few seconds. After waiting a few minutes we 
became anxious and started debugging.


Using "ps -Alf|grep postgres" we saw that there were no more normal client 
connections, but still 58 logical replicaton walsender processes and 6 
streaming replication walsenders.
"top" revealed that the walsenders were eating CPU. I know for sure that I saw 
plenty logical replication walsenders that were eating 100% CPU, but I have not 
checked if also the streaming replication walsenders were eating CPU.
I also straced the checkpointer process, and that was doing no I/O, just 
looping or polling (I can't remember the details). Our Performance Monitoring 
showed a general drop of I/O activity during the shutdown.

So this sounds somehow like the walsenders were stuck in a state that in the 
postgres mailing lists is called "busy loop".

Then we decided to remove the VIP (which is used by the logical replication 
clients) from db1 to trigger replication errors:
systemctl stop vip-manager.service
12:50:11 db1 commandlog[2276907]: root/root : ip a d x.x.x.x/24 dev bond0.203
12:50:11 db1 commandlog[2276907]: root/root : ip -6 a d xxxx:xxx:x:x::10/64 dev 
bond0.203

Then, either caused by that or by something else, the connections timed out. We 
are using the default timeouts:
wal_sender_timeout: 60s

Frankly, the timeouts started not 60s after removing the VIP, but already after 
45 seconds. But I guess the timeout triggered a bit eralier as there was no 
more traffic on the walsender connections except for the status keepalive from 
the clients.

12:50:55 db1 patroni[2586700]: patroni_postgres: 2025-07-16 12:50:55.977 UTC 
2586700 LOG:  terminating walsender process due to replication timeout
12:50:55 db1 patroni[2586700]: patroni_postgres: 2025-07-16 12:50:55.977 UTC 
2586700 STATEMENT:  START_REPLICATION SLOT "barman_sbg" 7DFD/5F000000 TIMELINE 
22
12:50:56 db1 patroni[2579309]: patroni_postgres: 2025-07-16 12:50:56.027 UTC 
2579309 LOG:  terminating walsender process due to replication timeout
12:50:56 db1 patroni[2579309]: patroni_postgres: 2025-07-16 12:50:56.027 UTC 
2579309 STATEMENT:  START_REPLICATION SLOT "reg_tam1" LOGICAL 75D4/F7FDBB8 
(proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[827455]: patroni_postgres: 2025-07-16 12:50:56.212 UTC 
827455 LOG:  terminating walsender process due to replication timeout
12:50:56 db1 patroni[827455]: patroni_postgres: 2025-07-16 12:50:56.212 UTC 
827455 STATEMENT:  START_REPLICATION SLOT "reg_lis1" LOGICAL 7C86/593393A0 
(proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[3390803]: patroni_postgres: 2025-07-16 12:50:56.506 UTC 
3390803 LOG:  terminating walsender process due to replication timeout
12:50:56 db1 patroni[3390803]: patroni_postgres: 2025-07-16 12:50:56.506 UTC 
3390803 STATEMENT:  START_REPLICATION SLOT "reg_rey1" LOGICAL 77A5/205F5F0 
(proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[137758]: patroni_postgres: 2025-07-16 12:50:56.672 UTC 
137758 LOG:  terminating walsender process due to replication timeout
12:50:56 db1 patroni[137758]: patroni_postgres: 2025-07-16 12:50:56.672 UTC 
137758 STATEMENT:  START_REPLICATION SLOT "reg_tlv1" LOGICAL 7DC2/CEAB27A0 
(proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[2586624]: patroni_postgres: 2025-07-16 12:50:56.929 UTC 
2586624 LOG:  terminating walsender process due to replication timeout
12:50:56 db1 patroni[2586624]: patroni_postgres: 2025-07-16 12:50:56.929 UTC 
2586624 STATEMENT:  START_REPLICATION SLOT "barman" 7DFD/60000000 TIMELINE 22

...and additional 54 logical walsender that timed out and were terminated

12:51:05 db1 patroni[2988]: patroni_postgres: 2025-07-16 12:51:05.902 UTC 2988 
LOG:  checkpoint starting: shutdown immediate
12:51:05 db1 patroni[2988]: patroni_postgres: 2025-07-16 12:51:05.926 UTC 2988 
LOG:  checkpoint complete: wrote 11 buffers (0.0%); 0 WAL file(s) added, 0 
removed, 0 recycled; write=0.021 s, sync=0.001 s, total=0.026 s; sync files=9, 
longest=0.001 s, average=0.001 s; distance=55 kB, estimate=1080354 kB; 
lsn=7E53/18554B28, redo lsn=7E53/18554B28
12:51:07 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:51:07.159 UTC 2748 
LOG:  database system is shut down
12:51:07 db1 patroni[2389]: 2025-07-16 12:51:07,954 INFO: Got response from db2 
https://db2:8008/patroni: {"state": "running", "postmaster_start_time": 
"2025-07-16 12:17:25.758501+00:00", "role": "replica", "server_version": 
160009, "xlog": {"received_location": 138895355628448, "replayed_location": 
138895355628448, "replayed_timestamp": "2025-07-16 12:32:42.142954+00:00", 
"paused": false}, "sync_standby": true, "timeline": 22, "dcs_last_seen": 
1752670265, "database_system_identifier": "7386557262250076198", "patroni": 
{"version": "4.0.5", "scope": "regdns_patroni_cluster_1", "name": "db2"}}


After closing all walsenders (that used the VIP) due to timeout (58 logical 
senders and 2 streaming senders to barman), the database was shut down and the 
switchover happened.

I do not know for sure, but I think the walsenders to the patroni replicas, 
that were using the host IP and not the VIP, were still active wenn we removed 
the VIP. But it seems they have shut down nicely after all the other walsenders 
terminated due to timeout.

So what caused the hanging? We found bug 
https://www.postgresql.org/message-id/aD0Xvh4Tj_Vk7pCA%40paquier.xyz but it 
seems that only affected logical walsender in standby mode.

We can not reproduce it - after a few minutes we made another switchover back 
to db1 and this time it was done within a few seconds.

Do you have any ideas what could be the problem, how we could solve it and how 
to debug if it happens again? 

Thanks
Klaus

Reply via email to