date:20200818

Replication lag due to lagging restart_lsn

2020-08-18 Thread Satyam Shekhar

Hello,

I wish to use logical replication in Postgres to capture transactions as
CDC and forward them to a custom sink.

To understand the overhead of logical replication workflow I created a toy
subscriber using the V3PGReplicationStream that acknowledges LSNs after
every 16k reads by calling setAppliedLsn, setFlushedLsn, and forceUpdateState.
The toy subscriber is set up as a subscriber for a master Postgres instance
that publishes changes using a Publication. I then run a write-heavy
workload on this setup that generates transaction logs at approximately
235MBps. Postgres is run on a beefy machine with a 10+GBps network link
between Postgres and the toy subscriber.

My expectation with this setup was that the replication lag on master would
be minimal as the subscriber acks the LSN almost immediately. However, I
observe the replication lag to increase continuously for the duration of
the test. Statistics in pg_replication_slots show that restart_lsn
lags significantly behind
the confirmed_flushed_lsn. Cursory reading on restart_lsn suggests that an
increasing gap between restart_lsn and confirmed_flushed_lsn means that
Postgres needs to reclaim disk space and advance restart_lsn to catch up to
confirmed_flushed_lsn.

With that context, I am looking for answers for two questions -

1. What work needs to happen in the database to advance restart_lsn to
confirmed_flushed_lsn?
2. What is the recommendation on tuning the database to improve the
replication lag in such scenarios?

Regards,
Satyam

CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-18 Thread Jim Jarvie


Using V12, Linux [Ubuntu 16.04LTS]

I have a system which implements a message queue with the basic pattern 
that a process selects a group of, for example 250, rows for processing 
via SELECT .. LIMIT 250 FOR UPDATE SKIP LOCKED.


When there are a small number of concurrent connections to process the 
queue, this seems to work as expected and connections quickly obtain a 
unique block of 250 rows for processing.


However, as I scale up the number of concurrent connections, I see a 
spike in CPU (to 100% across 80 cores) when the SELECT FOR UPDATE SKIP 
LOCKED executes and the select processes wait for multiple minutes 
(10-20 minutes) before completing.  My use case requires around 256 
concurrent processors for the queue but I've been unable to scale beyond 
128 without everything grinding to a halt.


The queue table itself fits in RAM (with 2M hugepages) and during the 
wait, all the performance counters drop to almost 0 - no disk read or 
write (semi-expected due to the table fitting in memory) with 100% 
buffer hit rate in pg_top and row read around 100/s which is much 
smaller than expected.


After processes complete the select and the number of waiting selects 
starts to fall, CPU load falls and then suddenly the remaining processes 
all complete within a few seconds and things perform normally until the 
next time there are a group of SELECT FOR UPDATE statements which bunch 
together and things then repeat.


I found that performing extremely frequent vacuum analyze (every 30 
minutes) helps a small amount but this is not that helpful so problems 
are still very apparent.


I've exhausted all the performance tuning and analysis results I can 
find that seem even a little bit relevant but cannot get this cracked.


Is anyone on the list able to help with suggestions of what I can do to 
track why this CPU hogging happens as this does seem to be the root of 
the problem?


Thanks in advance,

Jim

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-18 Thread Michael Lewis

Message queue...
Are rows deleted? Are they updated once or many times? Have you adjusted
fillfactor on table or indexes? How many rows in the table currently or on
average? Is there any ordering to which rows you update?

It seems likely that one of the experts/code contributors will chime in and
explain about how locking that many rows in that many concurrent
connections means that some resource is overrun and so you are escalating
to a table lock instead of actually truly locking only the 250 rows you
wanted.

On the other hand, you say 80 cores and you are trying to increase the
number of concurrent processes well beyond that without (much) disk I/O
being involved. I wouldn't expect that to perform awesome.

Is there a chance to modify the code to permit each process to lock 1000
rows at a time and be content with 64 concurrent processes?

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-18 Thread Jim Jarvie


Thank you for the quick response.

No adjustments of fill factors.  Hadn't though of that - I'll 
investigate and try some options to see if I can measure an effect.


There is some ordering on the select [ ORDER BY q_id] so each block of 
250 is sequential-ish queue items; I just need them more or less in the 
order they were queued so as near FIFO as possible without being totally 
strict on absolute sequential order.


Table has around 192K rows, as a row is processed it is deleted as part 
of the transaction with a commit at the end after all 250 are processed 
[partitioned table, state changes and it migrates to a different 
partition] and as the queue drops to 64K it is added to with 128K rows 
at a time.


I've tuned the LIMIT value both up and down.  As I move the limit up, 
the problem becomes substantially worse; 300 swamps it and the selects 
take > 1 hour to complete; at 600 they just all lock everything up and 
it stops processing.  I did try 1,000 but it basically resulted in 
nothing being processed.


Less processes does not give the throughput required because the queue 
sends data elsewhere which has a long round trip time but does permit 
over 1K concurrent connections as their work-round for throughput.  I'm 
stuck having to scale up my concurrent processes in order to compensate 
for the long processing time of an individual queue item.




On 18-Aug.-2020 20:08, Michael Lewis wrote:

Message queue...
Are rows deleted? Are they updated once or many times? Have you adjusted
fillfactor on table or indexes? How many rows in the table currently or on
average? Is there any ordering to which rows you update?

It seems likely that one of the experts/code contributors will chime in and
explain about how locking that many rows in that many concurrent
connections means that some resource is overrun and so you are escalating
to a table lock instead of actually truly locking only the 250 rows you
wanted.

On the other hand, you say 80 cores and you are trying to increase the
number of concurrent processes well beyond that without (much) disk I/O
being involved. I wouldn't expect that to perform awesome.

Is there a chance to modify the code to permit each process to lock 1000
rows at a time and be content with 64 concurrent processes?

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-18 Thread Henrique Montenegro

Did you try using NOWAIT instead of SKIP LOCKED to see if the behavior
still shows up?

On Tue, Aug 18, 2020, 8:22 PM Jim Jarvie  wrote:

> Thank you for the quick response.
>
> No adjustments of fill factors.  Hadn't though of that - I'll investigate
> and try some options to see if I can measure an effect.
>
> There is some ordering on the select [ ORDER BY q_id] so each block of 250
> is sequential-ish queue items; I just need them more or less in the order
> they were queued so as near FIFO as possible without being totally strict
> on absolute sequential order.
>
> Table has around 192K rows, as a row is processed it is deleted as part of
> the transaction with a commit at the end after all 250 are processed
> [partitioned table, state changes and it migrates to a different partition]
> and as the queue drops to 64K it is added to with 128K rows at a time.
>
> I've tuned the LIMIT value both up and down.  As I move the limit up, the
> problem becomes substantially worse; 300 swamps it and the selects take > 1
> hour to complete; at 600 they just all lock everything up and it stops
> processing.  I did try 1,000 but it basically resulted in nothing being
> processed.
> Less processes does not give the throughput required because the queue
> sends data elsewhere which has a long round trip time but does permit over
> 1K concurrent connections as their work-round for throughput.  I'm stuck
> having to scale up my concurrent processes in order to compensate for the
> long processing time of an individual queue item.
>
>
>
> On 18-Aug.-2020 20:08, Michael Lewis wrote:
>
> Message queue...
> Are rows deleted? Are they updated once or many times? Have you adjusted
> fillfactor on table or indexes? How many rows in the table currently or on
> average? Is there any ordering to which rows you update?
>
> It seems likely that one of the experts/code contributors will chime in and
> explain about how locking that many rows in that many concurrent
> connections means that some resource is overrun and so you are escalating
> to a table lock instead of actually truly locking only the 250 rows you
> wanted.
>
> On the other hand, you say 80 cores and you are trying to increase the
> number of concurrent processes well beyond that without (much) disk I/O
> being involved. I wouldn't expect that to perform awesome.
>
> Is there a chance to modify the code to permit each process to lock 1000
> rows at a time and be content with 64 concurrent processes?
>
>
>

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-18 Thread Michael Lewis

On Tue, Aug 18, 2020 at 6:22 PM Jim Jarvie  wrote:

> There is some ordering on the select [ ORDER BY q_id] so each block of 250
> is sequential-ish queue items; I just need them more or less in the order
> they were queued so as near FIFO as possible without being totally strict
> on absolute sequential order.
>
How long does each process take in total? How strict does that FIFO really
need to be when you are already doing SKIP LOCKED anyway?

Table has around 192K rows, as a row is processed it is deleted as part of
> the transaction with a commit at the end after all 250 are processed
> [partitioned table, state changes and it migrates to a different partition]
> and as the queue drops to 64K it is added to with 128K rows at a time.
>
Can you expound on the partitioning? Are all consumers of the queue always
hitting one active partition and anytime a row is processed, it always
moves to one of many? archived type partitions?

Less processes does not give the throughput required because the queue
> sends data elsewhere which has a long round trip time
>

Is that done via FDW or otherwise within the same database transaction? Are
you connecting some queue consumer application code to Postgres, select for
update, doing work on some remote system that is slow, and then coming back
and committing the DB work?

By the way, top-posting is discouraged here and partial quotes with
interspersed comments are common practice.

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-18 Thread Michael Lewis

Also, have you checked how bloated your indexes are getting? Do you run
default autovacuum settings? Did you update to the new default 2ms cost
delay value? With a destructive queue, it would be very important to ensure
autovacuum is keeping up with the churn. Share your basic table structure
and indexes, sanitized if need be.

>

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

2020-08-18 Thread Laurenz Albe

On Tue, 2020-08-18 at 19:52 -0400, Jim Jarvie wrote:
> I have a system which implements a message queue with the basic pattern that 
> a process selects a group of,
>  for example 250, rows for processing via SELECT .. LIMIT 250 FOR UPDATE SKIP 
> LOCKED.
> 
> When there are a small number of concurrent connections to process the queue, 
> this seems to work as
>  expected and connections quickly obtain a unique block of 250 rows for 
> processing.
> However, as I scale up the number of concurrent connections, I see a spike in 
> CPU (to 100% across 80 cores)
>  when the SELECT FOR UPDATE SKIP LOCKED executes and the select processes 
> wait for multiple minutes
>  (10-20 minutes) before completing.  My use case requires around 256 
> concurrent processors for the queue
>  but I've been unable to scale beyond 128 without everything grinding to a 
> halt.
> 
> The queue table itself fits in RAM (with 2M hugepages) and during the wait, 
> all the performance counters
>  drop to almost 0 - no disk read or write (semi-expected due to the table 
> fitting in memory) with 100%
>  buffer hit rate in pg_top and row read around 100/s which is much smaller 
> than expected.
> 
> After processes complete the select and the number of waiting selects starts 
> to fall, CPU load falls and
>  then suddenly the remaining processes all complete within a few seconds and 
> things perform normally until
>  the next time there are a group of SELECT  FOR UPDATE statements which bunch 
> together and things then repeat.
> 
> I found that performing extremely frequent vacuum analyze (every 30 minutes) 
> helps a small amount but
>  this is not that helpful so problems are still very apparent.
> 
> I've exhausted all the performance tuning and analysis results I can find 
> that seem even a little bit
>  relevant but cannot get this cracked.
> 
> Is anyone on the list able to help with suggestions of what I can do to track 
> why this CPU hogging happens
>  as this does seem to be the root of the problem?

You should

- check with "pgstattuple" if the table is bloated.

- use "perf" to see where the CPU time is spent.

- look at "pg_stat_activity" for wait events (unlikely if the CPU is busy).

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com

Replication lag due to lagging restart_lsn

CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

8 matches

Site Navigation

Mail list logo

Footer information