On Mon, Mar 23, 2026 at 9:51 AM surya poondla <[email protected]> wrote:
>
> Hi All,
>
> Thank you for reporting a real gap and building this feature to address it. 
> Very nice points were discussed in this thread.
>
> I reviewed the v20260318 patch and some comments.
>
> Documentation comments:
> 1. FIRST mode does not specify what happens when valid slots < N
> "If a slot is missing, logical, invalidated, or inactive, it will be skipped. 
> However, if a slot exists and is valid and active but has not yet caught up, 
> the system will wait for it rather than skipping to lower-priority slots."
> This paragraph explains the skip/wait distinction clearly, but doesn't 
> clearly address what happens when, after skipping all 
> missing/invalid/inactive/logical slots, the number of remaining valid slots 
> is less than num_sync?
>
> For example, with FIRST 2 (sb1_slot, sb2_slot, sb3_slot): if sb1_slot and 
> sb2_slot are both invalidated and only sb3_slot is valid but lagging FIRST 2 
> requires two slots, but only one candidate remains.
>
> Looking at the code in StandbySlotsHaveCaughtup(), when syncrep_method == 
> SYNC_REP_PRIORITY and a slot is lagging, the code does:
> if (wait_for_all || synchronized_standby_slots_config->syncrep_method == 
> SYNC_REP_PRIORITY)
>     break;
>
> So the function breaks out of the loop and returns false. This is the correct 
> behavior, but it is not stated anywhere in the documentation. A user 
> encountering this scenario will not know whether to expect a wait or an 
> error. The documentation should state explicitly that in FIRST mode, if fewer 
> valid slots than num_sync are available, logical decoding waits indefinitely.
>
> 2. "Missing, logical, invalidated, or inactive slots are skipped when 
> determining candidates, and lagging slots simply do not count toward the 
> required number until they catch up"
> This is correct for the case where some slots are skipped and others have 
> caught up. But it does not address the case where all listed slots are 
> lagging and every slot is healthy and connected, but none have reached 
> wait_for_lsn yet. In that situation, the code records each slot as 
> SS_SLOT_LAGGING, does goto next_slot for each (because syncrep_method == 
> SYNC_REP_QUORUM), and returns false because caught_up_slot_num < required. 
> Logical decoding waits.
>
> You can append the following sentence to the above documentation paragraph 
> "If fewer than num_sync slots have caught up at a given moment, logical 
> decoding waits until that threshold is reached."
>
>
> Test comments:
> 1. "PART D: Verify FIRST N priority semantics. # 3. Wait for valid but 
> lagging slots (not skip to lower priority)"
> The test implements this by calling $standby1->stop. A stopped standby has no 
> active_pid, so the slot is classified as SS_SLOT_INACTIVE, not 
> SS_SLOT_LAGGING.
> SS_SLOT_LAGGING means it is connected and streaming but restart_lsn < 
> wait_for_lsn.
>
> As Hou previously mentioned, recovery_min_apply_delay on standby1 would be 
> one way to keep it connected while forcing its WAL application to lag, 
> exercising the SS_SLOT_LAGGING code path directly. It is worth adding a test 
> that covers this, both for FIRST (to confirm it blocks) and for ANY (to 
> confirm it does not).
>
> Overall a great patch.
>

Thank you - Hou, Surya and Dilip for your feedback. I will address all
your comments in the next patch.

--
With Regards,
Ashutosh Sharma.


Reply via email to