Occasional lengthy locking causing stalling on commit

2021-01-30 Thread Ben Hoskings
G'day all,

We have a postgres 11.9 instance serving 7-8k queries/sec. In general
it's humming along with no issues (p50 patency of ~1ms; p95 ~30ms).
Lately though, we've seen occasional locking that causes all commits
to the main database to be briefly blocked and the APIs that it backs
to stall.

This is what I know so far:
  * It's happened at varying times of day over the last 3 weeks.
Around 10 significant cases (20-30 second stalls) and more shorter
ones.
  * Each time queries are blocked behind an AccessExclusiveLock on the
database (class 1262) that is itself blocked.
  * There's no change in DB throughput at the time.
  * No schema changes or any other queries are being made that would
involve an AccessExclusiveLock.
  * The blocked queries are all commit statements; our metrics show
dozens of traces each time that look perfectly normal except for a 20
second commit somewhere within them.
  * We make heavy use of listen/notify via que with ~80 notify events
per second across 16 listen channels. (https://github.com/que-rb/que)

Based on all that and the logs below, the locks in question appear to
be held while inserting NOTIFY messages on commit:
https://www.postgresql.org/message-id/3598.1363354686%40sss.pgh.pa.us

If that's the case, the real question is why a small number of those
commits are taking so much longer than expected. Any advice on how to
proceed from here much appreciated.

Cheers
Ben Hoskings
---

The observed outage here started at ~00:35:22 and ended when the
blocked locks were acquired.

2021-01-27 00:35:15.264 UTC [1554652]: [21-1] db=main,user=runtime
LOG: process 1554652 still waiting for ShareLock on transaction
3399854732 after 1000.091 ms
2021-01-27 00:35:15.264 UTC [1554652]: [22-1] db=main,user=runtime
DETAIL: Process holding the lock: 1542461. Wait queue: 1554652,
1542154, 1542496.
2021-01-27 00:35:15.264 UTC [1554652]: [23-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:15.264 UTC [1554652]: [24-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.203471', "updated_at" = '2021-01-27
00:35:14.204966' WHERE "devices"."id" = #
2021-01-27 00:35:15.265 UTC [1542154]: [82-1] db=main,user=runtime
LOG: process 1542154 still waiting for ShareLock on transaction
3399854732 after 1000.098 ms
2021-01-27 00:35:15.265 UTC [1542154]: [83-1] db=main,user=runtime
DETAIL: Process holding the lock: 1542461. Wait queue: 1554652,
1542154, 1542496.
2021-01-27 00:35:15.265 UTC [1542154]: [84-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:15.265 UTC [1542154]: [85-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.226635', "updated_at" = '2021-01-27
00:35:14.227932' WHERE "devices"."id" = #
2021-01-27 00:35:15.267 UTC [1542521]: [93-1] db=main,user=runtime
LOG: process 1542521 still waiting for ShareLock on transaction
3399854728 after 1000.067 ms
2021-01-27 00:35:15.267 UTC [1542521]: [94-1] db=main,user=runtime
DETAIL: Process holding the lock: 1547170. Wait queue: 1542514,
1542521.
2021-01-27 00:35:15.267 UTC [1542521]: [95-1] db=main,user=runtime
CONTEXT: while updating tuple (#,#) in relation "devices"
2021-01-27 00:35:15.267 UTC [1542521]: [96-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.257563', "updated_at" = '2021-01-27
00:35:14.258934' WHERE "devices"."id" = #
2021-01-27 00:35:15.278 UTC [1479762]: [60-1] db=main,user=runtime
LOG: process 1479762 still waiting for AccessExclusiveLock on object 0
of class 1262 of database 0 after 1000.129 ms
2021-01-27 00:35:15.278 UTC [1479762]: [61-1] db=main,user=runtime
DETAIL: Process holding the lock: 1518905. Wait queue: 1517976,
1518853, 1445675, 1518975, 1477989, 1518976, 1517975, 3018947,
1468183, 1477990, 1532581, 1474376, 1558501, 1477991, 1473039,
1468182, 1558509, 1477988, 1558516, 3019502, 1518852, 1473037,
1473038, 1547149, 1518909, 1479762, 1479765.
2021-01-27 00:35:15.290 UTC [1479765]: [51-1] db=main,user=runtime
LOG: process 1479765 still waiting for AccessExclusiveLock on object 0
of class 1262 of database 0 after 1000.110 ms
2021-01-27 00:35:15.290 UTC [1479765]: [52-1] db=main,user=runtime
DETAIL: Process holding the lock: 1518905. Wait queue: 1517976,
1518853, 1445675, 1518975, 1477989, 1518976, 1517975, 3018947,
1468183, 1477990, 1532581, 1474376, 1558501, 1477991, 1473039,
1468182, 1558509, 1477988, 1558516, 3019502, 1518852, 1473037,
1473038, 1547149, 1518909, 1479762, 1479765.
2021-01-27 00:35:15.290 UTC [1479765]: [53-1] db=main,user=runtime
STATEMENT: COMMIT

Re: Occasional lengthy locking causing stalling on commit

2021-01-31 Thread Ben Hoskings
On Sun, 31 Jan 2021 at 03:50, Tom Lane  wrote:
>
> Possibly you'd benefit from updating to v13, which has the listen/notify
> performance improvements Martijn referred to in the other thread.
>
> It's also possible that the hangup is unrelated to that, being somewhere
> later in commit processing than where the notify traffic gets dumped out.
> If you could identify which process has the "database 0" lock during one
> of these delays, and capture a stack trace from it, that would be
> super-informative.  Not sure about a good way to do that though.
> Maybe you could automate tailing the log for "DETAIL: Process holding the
> lock: nnn." and gdb'ing that process.
>
> regards, tom lane

Thanks for the reply Tom. We're running on Google Cloud SQL so we
won't be able to use gdb - unless we can convince Google to run it for
us :)

I wonder if there are any likely candidates that we could look into -
for example, is it possible it could be due a batched index update
that we could alleviate with "fastupdate=off"?

Cheers
Ben




Re: Occasional lengthy locking causing stalling on commit

2021-01-31 Thread Ben Hoskings
On Mon, 1 Feb 2021 at 10:33, Tom Lane  wrote:
>
> One thing that just occurred to me is that you might find it
> interesting to keep tabs on what's in the $PGDATA/pg_notify
> directory.  Do the performance burps correspond to transitory
> peaks in the amount of data there?  Or (grasping at straws here...)
> wraparound of the file names back to ?

We don't have filesystem access on Cloud SQL - the downside of the
managed route :)

It sounds like it might be time to bump the pg13 upgrade up the TODO list.

Cheers
Ben




Re: Occasional lengthy locking causing stalling on commit

2021-05-18 Thread Ben Hoskings
G'day all,

A quick follow-up on this issue for interest's sake. The stalling we
were seeing turned out to be a Cloud SQL issue and not related to our
listen/notify usage.

Cloud SQL has an automatic storage increase process that resizes the
underlying disk as required to account for cluster growth. As it turns
out that process occasionally causes I/O to stall for a brief window.
https://cloud.google.com/sql/docs/postgres/instance-settings#automatic-storage-increase-2ndgen

The workaround supplied by Google is to manually provision slack
storage in larger increments to prevent the more frequent automatic
increases, which happen 25GB at a time on a large cluster.

We didn't make the connection because disk resize events are not
visible in any logs; Google Support found the issue by correlating the
timestamps of our observed outages with their internal logs.

Hopefully this is useful for someone else. Thanks again for your help
Tom - your advice on listen/notify locking on commit was very useful
despite not being the cause in this case.

Cheers
Ben

On Mon, 1 Feb 2021 at 12:33, Ben Hoskings  wrote:
>
> On Mon, 1 Feb 2021 at 10:33, Tom Lane  wrote:
> >
> > One thing that just occurred to me is that you might find it
> > interesting to keep tabs on what's in the $PGDATA/pg_notify
> > directory.  Do the performance burps correspond to transitory
> > peaks in the amount of data there?  Or (grasping at straws here...)
> > wraparound of the file names back to ?
>
> We don't have filesystem access on Cloud SQL - the downside of the
> managed route :)
>
> It sounds like it might be time to bump the pg13 upgrade up the TODO list.
>
> Cheers
> Ben