Re: Slow planning time for simple query
On hot standby I faced with the similar problem. The following simple query: SELECT array_to_json(array_agg(t)) from ( select * from main.message m join main.message_instance mi on m.message_id = mi.message_id join main.channel_type ct on mi.channel_type_id = ct.channel_type_id where m.user_id = 2152595 and ct.name = 'android' and m.user_delete_ts is null order by case when read_ts is null then 0 else 1 end, m.message_id desc limit 100 ) t; is planned 4.940 ms on master and *254.741* ms on standby. I tried to investigate the reasons of so long planning on replica and determined that index accesses on planning stage come to multiple heap scans. Here is the execution plan statistics of query above: Aggregate (actual time=0.641..0.642 rows=1 loops=1) Buffers: shared hit=14 -> Subquery Scan on t (actual time=0.337..0.360 rows=2 loops=1) Buffers: shared hit=14 -> Limit (actual time=0.292..0.301 rows=2 loops=1) Buffers: shared hit=14 -> Sort (actual time=0.287..0.291 rows=2 loops=1) Sort Key: (CASE WHEN (m.read_ts IS NULL) THEN 0 ELSE 1 END), m.message_id DESC Sort Method: quicksort Memory: 27kB Buffers: shared hit=14 -> Nested Loop (actual time=0.157..0.219 rows=2 loops=1) Buffers: shared hit=14 -> Seq Scan on channel_type ct (actual time=0.043..0.048 rows=1 loops=1) Filter: (name = 'android'::text) Rows Removed by Filter: 7 Buffers: shared hit=1 -> Nested Loop (actual time=0.098..0.148 rows=2 loops=1) Buffers: shared hit=13 -> Index Scan using message_user_id_idx1 on message m (actual time=0.055..0.063 rows=2 loops=1) Index Cond: (user_id = 2152595) Buffers: shared hit=5 -> Index Scan using message_instance_pkey on message_instance mi (actual time=0.026..0.029 rows=1 loops=2) Index Cond: ((message_id = m.message_id) AND (channel_type_id = ct.channel_type_id)) Buffers: shared hit=8 The 14 accesses to buffer cache under query execution. But pg_statio_user_tables and pg_statio_user_indexes views show the different picture: select schemaname, relname, indexrelname, idx_blks_hit, idx_blks_read, pg_size_pretty(pg_relation_size(indexrelid::regclass)) from pg_statio_user_indexes where idx_blks_hit + idx_blks_read > 0; schemaname | relname | indexrelname | idx_blks_hit | idx_blks_read | pg_size_pretty +--+---+--+---+ main | channel_type | channel_type_pkey | 2 | 0 | 16 kB main | message_instance | message_instance_pkey | 666 | 0 | 345 MB main | message | message_pkey | 56 | 0 | 53 MB main | message | message_user_id_idx1 | 3 | 0 | 17 MB select schemaname, relname, heap_blks_read, heap_blks_hit from pg_statio_user_tables where heap_blks_read + heap_blks_hit + idx_blks_hit + idx_blks_read > 0; schemaname | relname | heap_blks_read | heap_blks_hit +--++--- main | channel_type | 0 | 3 main | message | 0 | 8682 main | message_instance | 0 | 114922 (This experiment is carried out on test hot standby without parallel activities and with cleared system statistics) The vacuum on problem tables (main.message and main.message_instance) on master node resolves the problem somehow but its often execution slows down all queries and generally increases IO. Is there any case to overcome the problem or it's fundamental issue and necessary to rewrite the query to simplify planning? -- Regards, Maksim Milyutin
Re: Slow planning time for simple query
On 09.06.2018 21:49, Maksim Milyutin wrote: On hot standby I faced with the similar problem. Sorry, the problem in question is described here https://www.postgresql.org/message-id/22136.1528312205%40sss.pgh.pa.us -- Regards, Maksim Milyutin
Re: Slow planning time for simple query
On 09.06.2018 22:49, Tom Lane wrote: Maksim Milyutin writes: On hot standby I faced with the similar problem. ... is planned 4.940 ms on master and *254.741* ms on standby. (I wonder though why, if you executed the same query on the master, its setting of the index-entry-is-dead bits didn't propagate to the standby.) I have verified the number dead item pointers (through pageinspect extension) in the first leaf page of index participating in query ('main.message_instance_pkey') on master and slave nodes and have noticed a big difference. SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705); On master: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 1 | 58 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 65 On standby: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 59 | 0 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 1 The vacuum routine improves the situation. Сan there be something that I have incorrectly configured WAL logging or replication? I wonder if we should extend the "SnapshotNonVacuumable" logic introduced in commit 3ca930fc3 so that in hot standby, *all* index entries are deemed non vacuumable. This would essentially get rid of long standby planning times in this sort of scenario by instead accepting worse (possibly much worse) planner range estimates. I'm unsure if that's a good tradeoff or not. I applied the patch introduced in this commit to test standby (not master; I don't know if this is correct) and haven't noticed any differences. -- Regards, Maksim Milyutin
Re: First query on each connection is too slow
On 13.06.2018 12:55, Vadim Nevorotin wrote: Nested Loop Left Join (cost=0.82..26.41 rows=1 width=4) *(actual time=49.290..49.297 rows=1 loops=1)* Join Filter: ((s_2.connamespace = n.oid) AND (s_2.conrelid = c.oid) AND (a.attnum = ANY (s_2.conkey))) Buffers: shared hit=18 ... * Planning time: 2.589 ms Execution time: 49.467 ms* (45 rows) And there are two questions: why this problem occurs and how can I fix it? You could catch a performance profile of this issue using perf tools and iterative executing this query in new sessions via pgbench, for example. This would greatly facilitate the search for the cause of your problem. -- Regards, Maksim Milyutin
Re: Slow planning time for simple query
13.06.2018 12:40, Maksim Milyutin wrote: On 09.06.2018 22:49, Tom Lane wrote: Maksim Milyutin writes: On hot standby I faced with the similar problem. ... is planned 4.940 ms on master and *254.741* ms on standby. (I wonder though why, if you executed the same query on the master, its setting of the index-entry-is-dead bits didn't propagate to the standby.) I have verified the number dead item pointers (through pageinspect extension) in the first leaf page of index participating in query ('main.message_instance_pkey') on master and slave nodes and have noticed a big difference. SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705); On master: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 1 | 58 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 65 On standby: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 59 | 0 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 1 In this point I want to highlight the issue that the changes in *lp_flags* bits (namely, set items as dead) for index item pointers doesn't propagate from master to replica in my case. As a consequence, on standby I have live index items most of which on master are marked as dead. And my queries on planning stage are forced to descent to heap pages under *get_actual_variable_range* execution that considerately slows down planning. Is it bug or restriction of implementation or misconfiguration of WAL/replication? -- Regards, Maksim Milyutin
Re: "checkpointer process" is consuming more memory. How to control it?
22.08.2018 16:43, Raghavendra Rao J S V wrote: We have a database cluster as "db1_data". Under this cluster we have two databases. one is *db1 *and other is *qovr*. I surprised to see as "checkpointer process" is consuming 8.73GB of memory(RSS value as 9158892). Why "checkpointer process" is consuming this much amount of memory and how to limit the usage of the "checkpointer process" memory. RSS value is not reasonable to determine memory leaks because it takes into account shared segments (e.g. from shared buffer cache). As a long-lived process checkpointer process tries to flush and as a consequence to touch each buffer cell therefore its RSS approaches to local allocated memory plus shared_buffers. If you want to know the real local memory consumption you may to use python utility *smem* to see unshared local memory size. -- Regards, Maksim Milyutin
Re: Commit to primary with unavailable sync standby
On 19.12.2019 14:04, Andrey Borodin wrote: Hi! Hi! FYI, this topic was up recently in -hackers https://www.postgresql.org/message-id/caeet0zhg5off7iecby6tzadh1moslmfz1hlm311p9vot7z+...@mail.gmail.com I cannot figure out proper way to implement safe HA upsert. I will be very grateful if someone would help me. Imagine we have primary server after failover. It is network-partitioned. We are doing INSERT ON CONFLICT DO NOTHING; that eventually timed out. az1-grx88oegoy6mrv2i/db1 M > WITH new_doc AS ( INSERT INTO t( pk, v, dt ) VALUES ( 5, 'text', now() ) ON CONFLICT (pk) DO NOTHING RETURNING pk, v, dt) SELECT new_doc.pk from new_doc; ^CCancel request sent WARNING: 01000: canceling wait for synchronous replication due to user request DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. LOCATION: SyncRepWaitForLSN, syncrep.c:264 Time: 2173.770 ms (00:02.174) Here our driver decided that something goes wrong and we retry query. az1-grx88oegoy6mrv2i/db1 M > WITH new_doc AS ( INSERT INTO t( pk, v, dt ) VALUES ( 5, 'text', now() ) ON CONFLICT (pk) DO NOTHING RETURNING pk, v, dt) SELECT new_doc.pk from new_doc; pk (0 rows) Time: 4.785 ms Now we have split-brain, because we acknowledged that row to client. How can I fix this? There must be some obvious trick, but I cannot see it... Or maybe cancel of sync replication should be disallowed and termination should be treated as system failure? I think the most appropriate way to handle such issues is to catch by client driver such warnings (with message about local commit) and mark the status of posted transaction as undetermined. If connection with sync replica will come back then this transaction eventually commits but after triggering of autofailover and *not replicating this commit to replica* this commit aborts. Therefore client have to wait some time (that exceeds the duration of autofailover) and check (logically based on committed data) the status of commit. The problem here is the locally committed data becomes visible to future transactions (before autofailover) that violates the property of consistent reading from master. IMO the more correct behavior for PostgreSQL here is to ignore any cancel / termination queries when backend is in status of waiting response from sync replicas. However, there is another way to get locally applied commits via restart of master after initial recovery. This case is described in doc https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION-HA . But here HA orchestrator agent can close access from external users (via pg_hba.conf manipulations) until PostgreSQL instance synchronizes its changes with all sync replicas as it's implemented in Stolon https://github.com/sorintlab/stolon/blob/master/doc/syncrepl.md#handling-postgresql-sync-repl-limits-under-such-circumstances . Best regards, Maksim Milyutin
Re: Commit to primary with unavailable sync standby
On 19.12.2019 18:08, Fabio Ugo Venchiarutti wrote: On 19/12/2019 13:58, Maksim Milyutin wrote: On 19.12.2019 14:04, Andrey Borodin wrote: Hi! Hi! FYI, this topic was up recently in -hackers https://www.postgresql.org/message-id/caeet0zhg5off7iecby6tzadh1moslmfz1hlm311p9vot7z+...@mail.gmail.com I cannot figure out proper way to implement safe HA upsert. I will be very grateful if someone would help me. Imagine we have primary server after failover. It is network-partitioned. We are doing INSERT ON CONFLICT DO NOTHING; that eventually timed out. az1-grx88oegoy6mrv2i/db1 M > WITH new_doc AS ( INSERT INTO t( pk, v, dt ) VALUES ( 5, 'text', now() ) ON CONFLICT (pk) DO NOTHING RETURNING pk, v, dt) SELECT new_doc.pk from new_doc; ^CCancel request sent WARNING: 01000: canceling wait for synchronous replication due to user request DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. LOCATION: SyncRepWaitForLSN, syncrep.c:264 Time: 2173.770 ms (00:02.174) Here our driver decided that something goes wrong and we retry query. az1-grx88oegoy6mrv2i/db1 M > WITH new_doc AS ( INSERT INTO t( pk, v, dt ) VALUES ( 5, 'text', now() ) ON CONFLICT (pk) DO NOTHING RETURNING pk, v, dt) SELECT new_doc.pk from new_doc; pk (0 rows) Time: 4.785 ms Now we have split-brain, because we acknowledged that row to client. How can I fix this? There must be some obvious trick, but I cannot see it... Or maybe cancel of sync replication should be disallowed and termination should be treated as system failure? I think the most appropriate way to handle such issues is to catch by client driver such warnings (with message about local commit) and mark the status of posted transaction as undetermined. If connection with sync replica will come back then this transaction eventually commits but after triggering of autofailover and *not replicating this commit to replica* this commit aborts. Therefore client have to wait some time (that exceeds the duration of autofailover) and check (logically based on committed data) the status of commit. The problem here is the locally committed data becomes visible to future transactions (before autofailover) that violates the property of consistent reading from master. IMO the more correct behavior for PostgreSQL here is to ignore any cancel / termination queries when backend is in status of waiting response from sync replicas. However, there is another way to get locally applied commits via restart of master after initial recovery. This case is described in doc https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION-HA . But here HA orchestrator agent can close access from external users (via pg_hba.conf manipulations) until PostgreSQL instance synchronizes And this is where the unsafety lies: that assumes that the isolated master is in enough of a sane state to apply a self-ban (and that can do it in near-zero time). Although the retry logic in Andrey's case is probably not ideal (and you offered a more correct approach to synchronous commit), there are many "grey area" failure modes that in his scenario would either prevent a given node from sealing up fast enuogh if at all (eg: PID congestion causing fork()/system() to fail while backends are already up and happily flushing WAL). This is particularly relevant to situations when only a subset of critical transactions set synchronous_commit to remote_*: it'd still be undesirable to sink "tier 2" data in a stale primary for any significant length of time). Could you more concrete describe your thesis? In my proposal the self-ban to master is applied after restarting one so that changes from locally committed transactions was not visible for new incoming transactions. In the case of postgres (or any RDBMS, really), all I can think of is either an inline proxy performing some validation as part of the forwarding (which is what we did internally but that has not been green lit for FOSS :( ) External validation unfortunately is not option here. AIMB the local commits become visible to future transactions coming to master and even if some proxy reports to client that transaction is not committed completely, new incoming transactions reading locally applied changes and making its changes based on these ones implicitly confirms the status of these changes as committed. or some logic in the backend that rejects asynchronous commits too if some condition is not met (eg: synchronous standby nodes not present - a builtin version of the pg_stat_replication look-aside CTE I suggested earl
Re: High Availability, guarantee to use sync nodes
On 31.08.2020 14:06, Dirk Krautschick wrote: Hi all, Hi if there are several PostgreSQL nodes replicating synchronous and I configure a guarantee that for example 3 of 5 nodes have to be in sync to go on with synchronous_standby_names and now I want to implement a load balancing scenario with active standby DBs…how or with what tool can I achieve that my read clients only use sync nodes. For sure somehow a check in the pg_stat_replication before is possible but is there something in parameters or with solutions like repmgr, patroni, etc. to do this? For such tools as patroni or stolon the list of sync replicas could be got from DCS storage (etcd, consul, etc). Your application or some proxy, e.g. haproxy with confd, might extract the list of sync nodes, subscribe on changes in KV entry that stores this list and routes queries according it. -- Regards, Maksim Milyutin