Re: Error on vacuum: xmin before relfrozenxid

2018-05-23 Thread Andrey Borodin
Hi!

> 24 мая 2018 г., в 0:55, Paolo Crosato  написал(а):
> 
> 1) VACUUM FULL was issued after the first time the error occurred, and a 
> couple of times later. CLUSTER was never run.
> 2) Several failovers tests were perfomed before the cluster was moved to 
> production. However, before the move, the whole cluster was wiped, including 
> all the application and monitoring users. After the db was moved to 
> production, a couple of users were added without any problem.
> 3) No, even if the replication level is set to logical in postgresql.conf, we 
> only use streaming replication.

I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR:  found xmin 747375134 
from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT:  automatic vacuum of 
table "postgres.pg_catalog.pg_database"

Table pg_database, probably, was not changed anyhow for a long period of 
database exploitation.
Unfortunately, I've found out this only there were million of xids left and had 
to vacuum freeze db in single user mode asap. But, probably, I will be able to 
restore database from backups and inspect it, if necessary. Though first 
occurrence of this error was beyond recovery window.

Best regards, Andrey Borodin.


Commit to primary with unavailable sync standby

2019-12-19 Thread Andrey Borodin
Hi!

I cannot figure out proper way to implement safe HA upsert. I will be very 
grateful if someone would help me.

Imagine we have primary server after failover. It is network-partitioned. We 
are doing INSERT ON CONFLICT DO NOTHING; that eventually timed out.

az1-grx88oegoy6mrv2i/db1 M > WITH new_doc AS (
INSERT INTO t(
pk,
v,
dt
)
VALUES
(
5,
'text',
now()
)
ON CONFLICT (pk) DO NOTHING
RETURNING pk,
  v,
  dt)
   SELECT new_doc.pk from new_doc;
^CCancel request sent
WARNING:  01000: canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not have been 
replicated to the standby.
LOCATION:  SyncRepWaitForLSN, syncrep.c:264
Time: 2173.770 ms (00:02.174)

Here our driver decided that something goes wrong and we retry query.

az1-grx88oegoy6mrv2i/db1 M > WITH new_doc AS (
INSERT INTO t(
pk,
v,
dt
)
VALUES
(
5,
'text',
now()
)
ON CONFLICT (pk) DO NOTHING
RETURNING pk,
  v,
  dt)
   SELECT new_doc.pk from new_doc;
 pk

(0 rows)

Time: 4.785 ms

Now we have split-brain, because we acknowledged that row to client.
How can I fix this?

There must be some obvious trick, but I cannot see it... Or maybe cancel of 
sync replication should be disallowed and termination should be treated as 
system failure?

Best regards, Andrey Borodin.



Re: Commit to primary with unavailable sync standby

2019-12-19 Thread Andrey Borodin
Hi Fabio!

Thanks for looking into this.

> 19 дек. 2019 г., в 17:14, Fabio Ugo Venchiarutti  
> написал(а):
> 
> 
> You're hitting the CAP theorem ( https://en.wikipedia.org/wiki/CAP_theorem )
> 
> 
> You cannot do it with fewer than 3 nodes, as the moment you set your standby 
> to synchronous to achieve consistency, both your nodes become single points 
> of failure.
We have 3 nodes, and the problem is reproducible with all standbys being 
synchronous.

> With 3 or more nodes you can perform what is called a quorum write against ( 
> floor( / 2) + 1 ) nodes .
The problem seems to be reproducible in quorum commit too.

> With 3+ nodes, the "easy" strategy is to set a  number of standby 
> nodes in synchronous_standby_names ( 
> https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-SYNCHRONOUS-STANDBY-NAMES
>  )
> 
> 
> This however makes it tricky to pick the correct standby for promotions 
> during auto-failovers, as you need to freeze all the standbys listed in the 
> above setting in order to correctly determine which one has the highest WAL 
> location without running into race conditions (as the operation is 
> non-atomic, stateful and sticky).
After promotion of any standby we still can commit to old primary with the 
combination of cancel and retry.

> I personally prefer to designate a fixed synchronous set at setup time and 
> automatically set a static synchronous_standby_names on the master whenever a 
> failover occurs. That allows for a simpler failover mechanism as you know 
> they got the latest WAL location.
No, synchronous standby does not necessarily own latest WAL. It has WAL point 
no earlier than all commits acknowledged to client.

Thanks!

Best regards, Andrey Borodin.