Reviving this thread.

On Sun, Jan 29, 2023 at 9:55 PM Bharath Rupireddy <
[email protected]> wrote:

> For proc die, it looks like the suggestion was to process it
> immediately and upon next restart, don't allow user connections unless
> all sync standbys were caught up. However, we need to be able to allow
> replication connections from standbys so that they'll be able to
> stream the needed WAL and catch up with primary, allow superuser or
> users with pg_monitor role to connect to perform ALTER SYSTEM to
> remove the unresponsive sync standbys if any from the list or disable
> sync replication altogether or monitor for flush lsn/catch up status.
> And block all other connections. Note that replication, superuser and
> users with pg_monitor role connections are allowed only after the
> server reaches a consistent state not before that to not read any
> inconsistent data.
>

Allowing replication, superuser and pg_monitor seems reasonable to me.


>
> The trickiest part of doing the above is how we detect upon restart
> that the server received proc die while waiting for sync replication
> ACK. One idea might be to set a flag in the control file before the
> crash. Second idea might be to write a marker file (although I don't
> favor this idea); presence indicates that the server was waiting for
> sync replication ACK before the crash. However, we may not detect all
> sorts of crashes in a backend when it is waiting for sync replication
> ACK to do any of these two ideas. Therefore, this may not be a
> complete solution.
>

You cannot control the crash, it can be a simple power failure too and none
of them could have reached the disk.
Additionally, this is in a critical transaction commit path.


>
> Third idea might be to just let the primary wait for sync standbys to
> catch up upon restart irrespective of whether it was crashed or not
> while waiting for sync replication ACK. While this idea works well
> without having to detect all sorts of crashes, the primary may not
> come up if any unresponsive standbys are present (currently, the
> primary continues to be operational for read-only queries at least
> irrespective of whether sync standbys have caught up or not).
>

I prefer this approach because depending on the quorum policy defined in
the synchrnous_standby_names, the primary will open connections for
read/writes.
If there is no progress from sync standbys then Postgres admin has to jump
in regardless.

Thanks,
Satya

Reply via email to