Hi All,

Thank you for reporting a real gap and building this feature to address it.
Very nice points were discussed in this thread.

I reviewed the v20260318 patch and some comments.

Documentation comments:
1. FIRST mode does not specify what happens when valid slots < N
"If a slot is missing, logical, invalidated, or inactive, it will be
skipped. However, if a slot exists and is valid and active but has not yet
caught up, the system will wait for it rather than skipping to
lower-priority slots."
This paragraph explains the skip/wait distinction clearly, but doesn't
clearly address what happens when, after skipping all
missing/invalid/inactive/logical slots, the number of remaining valid slots
is less than num_sync?

For example, with FIRST 2 (sb1_slot, sb2_slot, sb3_slot): if sb1_slot and
sb2_slot are both invalidated and only sb3_slot is valid but lagging FIRST
2 requires two slots, but only one candidate remains.

Looking at the code in StandbySlotsHaveCaughtup(), when syncrep_method ==
SYNC_REP_PRIORITY and a slot is lagging, the code does:
if (wait_for_all || synchronized_standby_slots_config->syncrep_method ==
SYNC_REP_PRIORITY)
    break;

So the function breaks out of the loop and returns false. This is the
correct behavior, but it is not stated anywhere in the documentation. A
user encountering this scenario will not know whether to expect a wait or
an error. The documentation should state explicitly that in FIRST mode, if
fewer valid slots than num_sync are available, logical decoding waits
indefinitely.

2. "Missing, logical, invalidated, or inactive slots are skipped when
determining candidates, and lagging slots simply do not count toward the
required number until they catch up"
This is correct for the case where some slots are skipped and others have
caught up. But it does not address the case where all listed slots are
lagging and every slot is healthy and connected, but none have reached
wait_for_lsn yet. In that situation, the code records each slot as
SS_SLOT_LAGGING, does goto next_slot for each (because syncrep_method ==
SYNC_REP_QUORUM), and returns false because caught_up_slot_num < required.
Logical decoding waits.

You can append the following sentence to the above documentation paragraph
"If fewer than num_sync slots have caught up at a given moment, logical
decoding waits until that threshold is reached."


Test comments:
1. "PART D: Verify FIRST N priority semantics. # 3. Wait for valid but
lagging slots (not skip to lower priority)"
The test implements this by calling $standby1->stop. A stopped standby has
no active_pid, so the slot is classified as SS_SLOT_INACTIVE, not
SS_SLOT_LAGGING.
SS_SLOT_LAGGING means it is connected and streaming but restart_lsn <
wait_for_lsn.

As Hou previously mentioned, recovery_min_apply_delay on standby1 would be
one way to keep it connected while forcing its WAL application to lag,
exercising the SS_SLOT_LAGGING code path directly. It is worth adding a
test that covers this, both for FIRST (to confirm it blocks) and for ANY
(to confirm it does not).

Overall a great patch.

Regards,
Surya Poondla

Reply via email to