Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

Fujii Masao Tue, 24 Mar 2026 02:16:14 -0700

On Tue, Mar 24, 2026 at 3:00 PM Fujii Masao <[email protected]> wrote:
>
> On Tue, Mar 24, 2026 at 1:01 PM Nisha Moond <[email protected]> wrote:
> > Hi Fujii-san,
> >
> > I tried reproducing the wait scenario as you mentioned, but could not
> > reproduce it.
> > Steps I followed:
> > 1) Place a debugger in the slotsync worker and hold it at
> > fetch_remote_slots() ... -> libpqsrv_get_result()
> > 2) Kill the primary.
> > 3) Triggered promotion of the standby and release debugger from slotsync 
> > worker.
> >
> > The slot sync worker stops when the promotion is triggered and then
> > restarts, but fails to connect to the primary. The promotion happens
> > immediately.
> > ```
> > LOG:  received promote request
> > LOG:  redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
> > 0.02 s, elapsed: 4574.89 s
> > LOG:  last completed transaction was at log time 2026-03-23
> > 17:13:15.782313+05:30
> > LOG:  replication slot synchronization worker will stop because
> > promotion is triggered
> > LOG:  slot sync worker started
> > ERROR:  synchronization worker "slotsync worker" could not connect to
> > the primary server: connection to server at "127.0.0.1", port 9933
> > failed: Connection refused
> > Is the server running on that host and accepting TCP/IP connections?
> > ```
> >
> > I’ll debug this further to understand it better.
> > In the meantime, please let me know if I’m missing any step, or if you
> > followed a specific setup/script to reproduce this scenario.
>
> Thanks for testing!
>
> If you killed the primary with a signal like SIGTERM, an RST packet might have
> been sent to the slotsync worker at that moment. That allowed the worker to
> detect the connection loss and exited the wait state, so promotion could
> complete as expected.
>
> To reproduce the issue, you'll need a scenario where the worker cannot detect
> the connection loss. For example, you could block network traffic (e.g., with
> iptables) between the primary and the slotsync worker. The key is to create
> a situation where the worker remains stuck waiting for input for a long time.


Here's one way to reproduce the issue using iptables:

----------------------------------------------------
[Set up slot synchronization environment]

initdb -D data --encoding=UTF8 --locale=C
cat <<EOF >> data/postgresql.conf
wal_level = logical
synchronized_standby_slots = 'physical_slot'
EOF
pg_ctl -D data start
pg_receivewal --create-slot -S physical_slot
pg_recvlogical --create-slot -S logical_slot -P pgoutput
--enable-failover -d postgres
psql -c "CREATE PUBLICATION mypub"

pg_basebackup -D sby1 -c fast -R -S physical_slot -d "dbname=postgres"
-h 127.0.0.1
cat <<EOF >> sby1/postgresql.conf
port = 5433
sync_replication_slots = on
hot_standby_feedback = on
EOF
pg_ctl -D sby1 start

psql -c "SELECT pg_logical_emit_message(true, 'abc', 'xyz')"


[Block network traffic used by slot synchronization]
su -
iptables -A INPUT -p tcp --sport 5432 -j DROP
iptables -A OUTPUT -p tcp --dport 5432 -j DROP


[Promote the standby]
# wait a few seconds
pg_ctl -D sby1 promote
----------------------------------------------------

In my tests on master, promotion got stuck in this scenario.
With the patch, promotion completed promptly.

After testing, you can remove the network block with:

iptables -D INPUT -p tcp --sport 5432 -j DROP
iptables -D OUTPUT -p tcp --dport 5432 -j DROP

Regards,


-- 
Fujii Masao

Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

Reply via email to