Re: Slow planning time for simple query

2018-06-09 Thread Maksim Milyutin

On hot standby I faced with the similar problem.


The following simple query:

SELECT array_to_json(array_agg(t)) from (
    select *
    from main.message m
    join main.message_instance mi on m.message_id = mi.message_id
    join main.channel_type ct on mi.channel_type_id = 
ct.channel_type_id

    where   m.user_id = 2152595
    and ct.name = 'android'
    and m.user_delete_ts is null
    order by
    case
    when read_ts is null then 0
    else 1
    end,
    m.message_id desc
    limit 100
    ) t;

is planned 4.940 ms on master and *254.741* ms on standby. I tried to 
investigate the reasons of so long planning on replica and determined 
that index accesses on planning stage come to multiple heap scans.



Here is the execution plan statistics of query above:

 Aggregate (actual time=0.641..0.642 rows=1 loops=1)
   Buffers: shared hit=14
   ->  Subquery Scan on t (actual time=0.337..0.360 rows=2 loops=1)
 Buffers: shared hit=14
 ->  Limit (actual time=0.292..0.301 rows=2 loops=1)
   Buffers: shared hit=14
   ->  Sort (actual time=0.287..0.291 rows=2 loops=1)
 Sort Key: (CASE WHEN (m.read_ts IS NULL) THEN 0 
ELSE 1 END), m.message_id DESC

 Sort Method: quicksort  Memory: 27kB
 Buffers: shared hit=14
 ->  Nested Loop (actual time=0.157..0.219 rows=2 
loops=1)

   Buffers: shared hit=14
   ->  Seq Scan on channel_type ct (actual 
time=0.043..0.048 rows=1 loops=1)

 Filter: (name = 'android'::text)
 Rows Removed by Filter: 7
 Buffers: shared hit=1
   ->  Nested Loop (actual time=0.098..0.148 
rows=2 loops=1)

 Buffers: shared hit=13
 ->  Index Scan using 
message_user_id_idx1 on message m (actual time=0.055..0.063 rows=2 loops=1)

   Index Cond: (user_id = 2152595)
   Buffers: shared hit=5
 ->  Index Scan using 
message_instance_pkey on message_instance mi (actual time=0.026..0.029 
rows=1 loops=2)
   Index Cond: ((message_id = 
m.message_id) AND (channel_type_id = ct.channel_type_id))

   Buffers: shared hit=8

The 14 accesses to buffer cache under query execution. But 
pg_statio_user_tables and pg_statio_user_indexes views show the 
different picture:


select
    schemaname, relname, indexrelname, idx_blks_hit, idx_blks_read,
pg_size_pretty(pg_relation_size(indexrelid::regclass))
from pg_statio_user_indexes
where idx_blks_hit + idx_blks_read > 0;

 schemaname | relname  | indexrelname  | idx_blks_hit | 
idx_blks_read | pg_size_pretty

+--+---+--+---+
 main   | channel_type | channel_type_pkey |    2 
| 0 | 16 kB
 main   | message_instance | message_instance_pkey |  666 
| 0 | 345 MB
 main   | message  | message_pkey |   56 
| 0 | 53 MB
 main   | message  | message_user_id_idx1 |    3 
| 0 | 17 MB


select
    schemaname, relname, heap_blks_read, heap_blks_hit
from pg_statio_user_tables
where heap_blks_read + heap_blks_hit + idx_blks_hit + idx_blks_read > 0;

 schemaname | relname  | heap_blks_read | heap_blks_hit
+--++---
 main   | channel_type |  0 | 3
 main   | message  |  0 |  8682
 main   | message_instance |  0 | 114922

(This experiment is carried out on test hot standby without parallel 
activities and with cleared system statistics)



The vacuum on problem tables (main.message and main.message_instance) on 
master node resolves the problem somehow but its often execution slows 
down all queries and generally increases IO.



Is there any case to overcome the problem or it's fundamental issue and 
necessary to rewrite the query to simplify planning?



--
Regards,
Maksim Milyutin




Re: Slow planning time for simple query

2018-06-09 Thread Maksim Milyutin

On 09.06.2018 21:49, Maksim Milyutin wrote:


On hot standby I faced with the similar problem.


Sorry, the problem in question is described here 
https://www.postgresql.org/message-id/22136.1528312205%40sss.pgh.pa.us


--
Regards,
Maksim Milyutin




Re: Slow planning time for simple query

2018-06-13 Thread Maksim Milyutin

On 09.06.2018 22:49, Tom Lane wrote:


Maksim Milyutin  writes:

On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.


(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)


I have verified the number dead item pointers (through pageinspect 
extension) in the first leaf page of index participating in query 
('main.message_instance_pkey') on master and slave nodes and have 
noticed a big difference.


SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705);

On master:

 blkno | type | live_items | dead_items | avg_item_size | page_size | 
free_size | btpo_prev | btpo_next | btpo | btpo_flags

---+--+++---+---+---+---+---+--+
  3705 | l    |  1 | 58 |    24 |  8192 
|  6496 | 0 |  3719 |    0 | 65


On standby:

 blkno | type | live_items | dead_items | avg_item_size | page_size | 
free_size | btpo_prev | btpo_next | btpo | btpo_flags

---+--+++---+---+---+---+---+--+
  3705 | l    | 59 |  0 |    24 |  8192 
|  6496 | 0 |  3719 |    0 | 1



The vacuum routine improves the situation.
Сan there be something that I have incorrectly configured WAL logging or 
replication?



I wonder if we should extend the "SnapshotNonVacuumable" logic introduced
in commit 3ca930fc3 so that in hot standby, *all* index entries are deemed
non vacuumable.  This would essentially get rid of long standby planning
times in this sort of scenario by instead accepting worse (possibly much
worse) planner range estimates.  I'm unsure if that's a good tradeoff or
not.


I applied the patch introduced in this commit to test standby (not 
master; I don't know if this is correct) and haven't noticed any 
differences.


--
Regards,
Maksim Milyutin



Re: First query on each connection is too slow

2018-06-13 Thread Maksim Milyutin

On 13.06.2018 12:55, Vadim Nevorotin wrote:

Nested Loop Left Join  (cost=0.82..26.41 rows=1 width=4) *(actual 
time=49.290..49.297 rows=1 loops=1)*
   Join Filter: ((s_2.connamespace = n.oid) AND (s_2.conrelid = c.oid) 
AND (a.attnum = ANY (s_2.conkey)))

   Buffers: shared hit=18
...
* Planning time: 2.589 ms
 Execution time: 49.467 ms*

(45 rows)

And there are two questions: why this problem occurs and how can I fix it?


You could catch a performance profile of this issue using perf tools and 
iterative executing this query in new sessions via pgbench, for example. 
This would greatly facilitate the search for the cause of your problem.


--
Regards,
Maksim Milyutin



Re: Slow planning time for simple query

2018-06-13 Thread Maksim Milyutin

13.06.2018 12:40, Maksim Milyutin wrote:


On 09.06.2018 22:49, Tom Lane wrote:


Maksim Milyutin  writes:

On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.

(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)


I have verified the number dead item pointers (through pageinspect 
extension) in the first leaf page of index participating in query 
('main.message_instance_pkey') on master and slave nodes and have 
noticed a big difference.


SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 
3705);


On master:

 blkno | type | live_items | dead_items | avg_item_size | page_size | 
free_size | btpo_prev | btpo_next | btpo | btpo_flags

---+--+++---+---+---+---+---+--+
  3705 | l    |  1 | 58 |    24 |  8192 
|  6496 | 0 |  3719 |    0 | 65


On standby:

 blkno | type | live_items | dead_items | avg_item_size | page_size | 
free_size | btpo_prev | btpo_next | btpo | btpo_flags

---+--+++---+---+---+---+---+--+
  3705 | l    | 59 |  0 |    24 |  8192 
|  6496 | 0 |  3719 |    0 |  1





In this point I want to highlight the issue that the changes in 
*lp_flags* bits (namely, set items as dead) for index item pointers 
doesn't propagate from master to replica in my case. As a consequence, 
on standby I have live index items most of which on master are marked as 
dead. And my queries on planning stage are forced to descent to heap 
pages under *get_actual_variable_range* execution that considerately 
slows down planning.


Is it bug or restriction of implementation or misconfiguration of 
WAL/replication?


--
Regards,
Maksim Milyutin



Re: "checkpointer process" is consuming more memory. How to control it?

2018-08-22 Thread Maksim Milyutin

22.08.2018 16:43, Raghavendra Rao J S V wrote:

We have a database cluster as "db1_data". Under this cluster we have 
two databases. one is *db1 *and other is *qovr*. I surprised to see as 
"checkpointer process" is consuming 8.73GB of memory(RSS value as 
9158892). Why "checkpointer process" is consuming this much amount of 
memory and how to limit the usage of the "checkpointer process" memory.


RSS value is not reasonable to determine memory leaks because it takes 
into account shared segments (e.g. from shared buffer cache). As a 
long-lived process checkpointer process tries to flush and as a 
consequence to touch each buffer cell therefore its RSS approaches to 
local allocated memory plus shared_buffers.


If you want to know the real local memory consumption you may to use 
python utility *smem* to see unshared local memory size.


--
Regards,
Maksim Milyutin



Re: Commit to primary with unavailable sync standby

2019-12-19 Thread Maksim Milyutin

On 19.12.2019 14:04, Andrey Borodin wrote:


Hi!



Hi!

FYI, this topic was up recently in -hackers 
https://www.postgresql.org/message-id/caeet0zhg5off7iecby6tzadh1moslmfz1hlm311p9vot7z+...@mail.gmail.com




I cannot figure out proper way to implement safe HA upsert. I will be very 
grateful if someone would help me.

Imagine we have primary server after failover. It is network-partitioned. We 
are doing INSERT ON CONFLICT DO NOTHING; that eventually timed out.

az1-grx88oegoy6mrv2i/db1 M > WITH new_doc AS (
 INSERT INTO t(
 pk,
 v,
 dt
 )
 VALUES
 (
 5,
 'text',
 now()
 )
 ON CONFLICT (pk) DO NOTHING
 RETURNING pk,
   v,
   dt)
SELECT new_doc.pk from new_doc;
^CCancel request sent
WARNING:  01000: canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not have been 
replicated to the standby.
LOCATION:  SyncRepWaitForLSN, syncrep.c:264
Time: 2173.770 ms (00:02.174)

Here our driver decided that something goes wrong and we retry query.

az1-grx88oegoy6mrv2i/db1 M > WITH new_doc AS (
 INSERT INTO t(
 pk,
 v,
 dt
 )
 VALUES
 (
 5,
 'text',
 now()
 )
 ON CONFLICT (pk) DO NOTHING
 RETURNING pk,
   v,
   dt)
SELECT new_doc.pk from new_doc;
  pk

(0 rows)

Time: 4.785 ms

Now we have split-brain, because we acknowledged that row to client.
How can I fix this?

There must be some obvious trick, but I cannot see it... Or maybe cancel of 
sync replication should be disallowed and termination should be treated as 
system failure?



I think the most appropriate way to handle such issues is to catch by 
client driver such warnings (with message about local commit) and mark 
the status of posted transaction as undetermined. If connection with 
sync replica will come back then this transaction eventually commits but 
after triggering of autofailover and *not replicating this commit to 
replica* this commit aborts. Therefore client have to wait some time 
(that exceeds the duration of autofailover) and check (logically based 
on committed data) the status of commit.


The problem here is the locally committed data becomes visible to future 
transactions (before autofailover) that violates the property of 
consistent reading from master. IMO the more correct behavior for 
PostgreSQL here is to ignore any cancel / termination queries when 
backend is in status of waiting response from sync replicas.


However, there is another way to get locally applied commits via restart 
of master after initial recovery. This case is described in doc 
https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION-HA 
. But here HA orchestrator agent can close access from external users 
(via pg_hba.conf manipulations) until PostgreSQL instance synchronizes 
its changes with all sync replicas as it's implemented in Stolon 
https://github.com/sorintlab/stolon/blob/master/doc/syncrepl.md#handling-postgresql-sync-repl-limits-under-such-circumstances 
.



Best regards,
Maksim Milyutin





Re: Commit to primary with unavailable sync standby

2019-12-23 Thread Maksim Milyutin



On 19.12.2019 18:08, Fabio Ugo Venchiarutti wrote:



On 19/12/2019 13:58, Maksim Milyutin wrote:

On 19.12.2019 14:04, Andrey Borodin wrote:


Hi!



Hi!

FYI, this topic was up recently in -hackers 
https://www.postgresql.org/message-id/caeet0zhg5off7iecby6tzadh1moslmfz1hlm311p9vot7z+...@mail.gmail.com 




I cannot figure out proper way to implement safe HA upsert. I will 
be very grateful if someone would help me.


Imagine we have primary server after failover. It is 
network-partitioned. We are doing INSERT ON CONFLICT DO NOTHING; 
that eventually timed out.


az1-grx88oegoy6mrv2i/db1 M > WITH new_doc AS (
 INSERT INTO t(
 pk,
 v,
 dt
 )
 VALUES
 (
 5,
 'text',
 now()
 )
 ON CONFLICT (pk) DO NOTHING
 RETURNING pk,
   v,
   dt)
    SELECT new_doc.pk from new_doc;
^CCancel request sent
WARNING:  01000: canceling wait for synchronous replication due to 
user request
DETAIL:  The transaction has already committed locally, but might 
not have been replicated to the standby.

LOCATION:  SyncRepWaitForLSN, syncrep.c:264
Time: 2173.770 ms (00:02.174)

Here our driver decided that something goes wrong and we retry query.

az1-grx88oegoy6mrv2i/db1 M > WITH new_doc AS (
 INSERT INTO t(
 pk,
 v,
 dt
 )
 VALUES
 (
 5,
 'text',
 now()
 )
 ON CONFLICT (pk) DO NOTHING
 RETURNING pk,
   v,
   dt)
    SELECT new_doc.pk from new_doc;
  pk

(0 rows)

Time: 4.785 ms

Now we have split-brain, because we acknowledged that row to client.
How can I fix this?

There must be some obvious trick, but I cannot see it... Or maybe 
cancel of sync replication should be disallowed and termination 
should be treated as system failure?




I think the most appropriate way to handle such issues is to catch by 
client driver such warnings (with message about local commit) and 
mark the status of posted transaction as undetermined. If connection 
with sync replica will come back then this transaction eventually 
commits but after triggering of autofailover and *not replicating 
this commit to replica* this commit aborts. Therefore client have to 
wait some time (that exceeds the duration of autofailover) and check 
(logically based on committed data) the status of commit.


The problem here is the locally committed data becomes visible to 
future transactions (before autofailover) that violates the property 
of consistent reading from master. IMO the more correct behavior for 
PostgreSQL here is to ignore any cancel / termination queries when 
backend is in status of waiting response from sync replicas.


However, there is another way to get locally applied commits via 
restart of master after initial recovery. This case is described in 
doc 
https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION-HA 
. But here HA orchestrator agent can close access from external users 
(via pg_hba.conf manipulations) until PostgreSQL instance synchronizes



And this is where the unsafety lies: that assumes that the isolated 
master is in enough of a sane state to apply a self-ban (and that can 
do it in near-zero time).



Although the retry logic in Andrey's case is probably not ideal (and 
you offered a more correct approach to synchronous commit), there are 
many "grey area" failure modes that in his scenario would either 
prevent a given node from sealing up fast enuogh if at all (eg: PID 
congestion causing fork()/system() to fail while backends are already 
up and happily flushing WAL).



This is particularly relevant to situations when only a subset of 
critical transactions set synchronous_commit to remote_*: it'd still 
be undesirable to sink "tier 2" data in a stale primary for any 
significant length of time).



Could you more concrete describe your thesis? In my proposal the 
self-ban to master is applied after restarting one so that changes from 
locally committed transactions was not visible for new incoming 
transactions.



In the case of postgres (or any RDBMS, really), all I can think of is 
either an inline proxy performing some validation as part of the 
forwarding (which is what we did internally but that has not been 
green lit for FOSS :( )



External validation unfortunately is not option here. AIMB the local 
commits become visible to future transactions coming to master and even 
if some proxy reports to client that transaction is not committed 
completely, new incoming transactions reading locally applied changes 
and making its changes based on these ones implicitly confirms the 
status of these changes as committed.



or some logic in the backend that rejects asynchronous commits too if 
some condition is not met (eg:  synchronous standby nodes 
not present - a builtin version of the pg_stat_replication look-aside 
CTE I suggested earl

Re: High Availability, guarantee to use sync nodes

2020-09-01 Thread Maksim Milyutin

On 31.08.2020 14:06, Dirk Krautschick wrote:


Hi all,



Hi

if there are several PostgreSQL nodes replicating synchronous and I 
configure a guarantee that


for example 3 of 5 nodes have to be in sync to go on with 
synchronous_standby_names and


now I want to implement a load balancing scenario with active standby 
DBs…how or with what


tool can I achieve that my read clients only use sync nodes. For sure 
somehow a check in the


pg_stat_replication before is possible but is there something in 
parameters or with solutions


like repmgr, patroni, etc. to do this?



For such tools as patroni or stolon the list of sync replicas could be 
got from DCS storage (etcd, consul, etc). Your application or some 
proxy, e.g. haproxy with confd, might extract the list of sync nodes, 
subscribe on changes in KV entry that stores this list and routes 
queries according it.



--
Regards,
Maksim Milyutin