Fix slotsync worker blocking promotion when stuck in wait Previously, on standby promotion, the startup process sent SIGUSR1 to the slotsync worker (or a backend performing slot synchronization) and waited for it to exit. This worked in most cases, but if the process was blocked waiting for a response from the primary (e.g., due to a network failure), SIGUSR1 would not interrupt the wait. As a result, the process could remain stuck, causing the startup process to wait for a long time and delaying promotion.
This commit fixes the issue by introducing a new procsignal reason, PROCSIG_SLOTSYNC_MESSAGE. On promotion, the startup process sends this signal, and the handler sets interrupt flags so the process exits (or errors out) promptly at CHECK_FOR_INTERRUPTS(), allowing promotion to complete without delay. Backpatch to v17, where slotsync was introduced. Author: Nisha Moond <[email protected]> Reviewed-by: shveta malik <[email protected]> Reviewed-by: Amit Kapila <[email protected]> Reviewed-by: Zhijie Hou <[email protected]> Reviewed-by: Fujii Masao <[email protected]> Discussion: https://postgr.es/m/CAHGQGwFzNYroAxSoyJhqTU-pH=t4ej6ryvhvmbz91exj_tp...@mail.gmail.com Backpatch-through: 17 Branch ------ REL_17_STABLE Details ------- https://git.postgresql.org/pg/commitdiff/15910b1c363f47b3984d24a91ed75ddac36070d8 Modified Files -------------- src/backend/replication/logical/slotsync.c | 138 ++++++++++++++++++++--------- src/backend/storage/ipc/procsignal.c | 4 + src/backend/tcop/postgres.c | 4 + src/include/replication/slotsync.h | 7 ++ src/include/storage/procsignal.h | 1 + 5 files changed, 111 insertions(+), 43 deletions(-)
