G'day all,
We have a postgres 11.9 instance serving 7-8k queries/sec. In general
it's humming along with no issues (p50 patency of ~1ms; p95 ~30ms).
Lately though, we've seen occasional locking that causes all commits
to the main database to be briefly blocked and the APIs that it backs
to stall.
This is what I know so far:
* It's happened at varying times of day over the last 3 weeks.
Around 10 significant cases (20-30 second stalls) and more shorter
ones.
* Each time queries are blocked behind an AccessExclusiveLock on the
database (class 1262) that is itself blocked.
* There's no change in DB throughput at the time.
* No schema changes or any other queries are being made that would
involve an AccessExclusiveLock.
* The blocked queries are all commit statements; our metrics show
dozens of traces each time that look perfectly normal except for a 20
second commit somewhere within them.
* We make heavy use of listen/notify via que with ~80 notify events
per second across 16 listen channels. (https://github.com/que-rb/que)
Based on all that and the logs below, the locks in question appear to
be held while inserting NOTIFY messages on commit:
https://www.postgresql.org/message-id/3598.1363354686%40sss.pgh.pa.us
If that's the case, the real question is why a small number of those
commits are taking so much longer than expected. Any advice on how to
proceed from here much appreciated.
Cheers
Ben Hoskings
---
The observed outage here started at ~00:35:22 and ended when the
blocked locks were acquired.
2021-01-27 00:35:15.264 UTC [1554652]: [21-1] db=main,user=runtime
LOG: process 1554652 still waiting for ShareLock on transaction
3399854732 after 1000.091 ms
2021-01-27 00:35:15.264 UTC [1554652]: [22-1] db=main,user=runtime
DETAIL: Process holding the lock: 1542461. Wait queue: 1554652,
1542154, 1542496.
2021-01-27 00:35:15.264 UTC [1554652]: [23-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:15.264 UTC [1554652]: [24-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.203471', "updated_at" = '2021-01-27
00:35:14.204966' WHERE "devices"."id" = #
2021-01-27 00:35:15.265 UTC [1542154]: [82-1] db=main,user=runtime
LOG: process 1542154 still waiting for ShareLock on transaction
3399854732 after 1000.098 ms
2021-01-27 00:35:15.265 UTC [1542154]: [83-1] db=main,user=runtime
DETAIL: Process holding the lock: 1542461. Wait queue: 1554652,
1542154, 1542496.
2021-01-27 00:35:15.265 UTC [1542154]: [84-1] db=main,user=runtime
CONTEXT: while rechecking updated tuple (#,#) in relation "devices"
2021-01-27 00:35:15.265 UTC [1542154]: [85-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.226635', "updated_at" = '2021-01-27
00:35:14.227932' WHERE "devices"."id" = #
2021-01-27 00:35:15.267 UTC [1542521]: [93-1] db=main,user=runtime
LOG: process 1542521 still waiting for ShareLock on transaction
3399854728 after 1000.067 ms
2021-01-27 00:35:15.267 UTC [1542521]: [94-1] db=main,user=runtime
DETAIL: Process holding the lock: 1547170. Wait queue: 1542514,
1542521.
2021-01-27 00:35:15.267 UTC [1542521]: [95-1] db=main,user=runtime
CONTEXT: while updating tuple (#,#) in relation "devices"
2021-01-27 00:35:15.267 UTC [1542521]: [96-1] db=main,user=runtime
STATEMENT: UPDATE "devices" SET "last_successful_request_at" =
'2021-01-27 00:35:14.257563', "updated_at" = '2021-01-27
00:35:14.258934' WHERE "devices"."id" = #
2021-01-27 00:35:15.278 UTC [1479762]: [60-1] db=main,user=runtime
LOG: process 1479762 still waiting for AccessExclusiveLock on object 0
of class 1262 of database 0 after 1000.129 ms
2021-01-27 00:35:15.278 UTC [1479762]: [61-1] db=main,user=runtime
DETAIL: Process holding the lock: 1518905. Wait queue: 1517976,
1518853, 1445675, 1518975, 1477989, 1518976, 1517975, 3018947,
1468183, 1477990, 1532581, 1474376, 1558501, 1477991, 1473039,
1468182, 1558509, 1477988, 1558516, 3019502, 1518852, 1473037,
1473038, 1547149, 1518909, 1479762, 1479765.
2021-01-27 00:35:15.290 UTC [1479765]: [51-1] db=main,user=runtime
LOG: process 1479765 still waiting for AccessExclusiveLock on object 0
of class 1262 of database 0 after 1000.110 ms
2021-01-27 00:35:15.290 UTC [1479765]: [52-1] db=main,user=runtime
DETAIL: Process holding the lock: 1518905. Wait queue: 1517976,
1518853, 1445675, 1518975, 1477989, 1518976, 1517975, 3018947,
1468183, 1477990, 1532581, 1474376, 1558501, 1477991, 1473039,
1468182, 1558509, 1477988, 1558516, 3019502, 1518852, 1473037,
1473038, 1547149, 1518909, 1479762, 1479765.
2021-01-27 00:35:15.290 UTC [1479765]: [53-1] db=main,user=runtime
STATEMENT: COMMIT