Database Stalls
Hi, We've started to observe instances of one of our databases stalling for a few seconds. We see a spike in wal write locks then nothing for a few seconds. After which we have spike latency as processes waiting to get to the db can do so. There is nothing in the postgres logs that give us any clues to what could be happening, no locks, unusually high/long running transactions, just a pause and resume. Could anyone give me any advice as to what to look for when it comes to checking the underlying disk that the db is on? Thanks, Gurmokh
Re: Database Stalls
On Mon, Jan 30, 2023 at 05:47:49PM +, Mok wrote: > Hi, > > We've started to observe instances of one of our databases stalling for a > few seconds. > > We see a spike in wal write locks then nothing for a few seconds. After > which we have spike latency as processes waiting to get to the db can do > so. > > There is nothing in the postgres logs that give us any clues to what could > be happening, no locks, unusually high/long running transactions, just a > pause and resume. > > Could anyone give me any advice as to what to look for when it comes to > checking the underlying disk that the db is on? What version postgres? What settings have non-default values ? What OS/version? What environment/hardware? VM/image/provider/... Have you enabled logging for vacuum/checkpoints/locks ? https://wiki.postgresql.org/wiki/Slow_Query_Questions
Re: Database Stalls
On Mon, Jan 30, 2023 at 2:51 PM Justin Pryzby wrote: > On Mon, Jan 30, 2023 at 05:47:49PM +, Mok wrote: > > Hi, > > > > We've started to observe instances of one of our databases stalling for a > > few seconds. > > > > We see a spike in wal write locks then nothing for a few seconds. After > > which we have spike latency as processes waiting to get to the db can do > > so. > > > > There is nothing in the postgres logs that give us any clues to what > could > > be happening, no locks, unusually high/long running transactions, just a > > pause and resume. > > > > Could anyone give me any advice as to what to look for when it comes to > > checking the underlying disk that the db is on? > > What version postgres? What settings have non-default values ? > > > What OS/version? What environment/hardware? VM/image/provider/... > > > > Have you enabled logging for vacuum/checkpoints/locks ? > > https://wiki.postgresql.org/wiki/Slow_Query_Questions > > > In addition to previous questions, if possible, a SELECT * FROM pg_stat_activity at the moment of the stall. The most important information is the wait_event column. My guess is the disk, but just the select at the right moment can answer this. -- José Arthur Benetasso Villanova
Re: Database Stalls
Hi Burmokh, Please take a look at this article copied below and ping me for further guidance. Thanks! How expensive SQLs can impact PostgreSQL Performance? - https://minervadb.xyz/how-expensive-sqls-can-impact-postgresql-performance/ — Best Shiv > On 30-Jan-2023, at 11:17 PM, Mok wrote: > > Hi, > > We've started to observe instances of one of our databases stalling for a few > seconds. > > We see a spike in wal write locks then nothing for a few seconds. After which > we have spike latency as processes waiting to get to the db can do so. > > There is nothing in the postgres logs that give us any clues to what could be > happening, no locks, unusually high/long running transactions, just a pause > and resume. > > Could anyone give me any advice as to what to look for when it comes to > checking the underlying disk that the db is on? > > Thanks, > > Gurmokh > >
Re: Database Stalls
Hi, Unfortunately there is no pg_stat_activity data available as we are unaware of the issue until it has already happened. The version we are on is 12.11. I don't think it is due to locks as there are none in the logs. Vacuums are logged also and none occur before or after this event. Checkpoint timeout is set to 1 hour and these events do not coincide with checkpoints. Gurmokh On Mon, 30 Jan 2023 at 18:47, Shiv Iyer wrote: > Hi Burmokh, > > Please take a look at this article copied below and ping me for further > guidance. Thanks! > > > How expensive SQLs can impact PostgreSQL Performance? - > https://minervadb.xyz/how-expensive-sqls-can-impact-postgresql-performance/ > > > — > Best > Shiv > > > > On 30-Jan-2023, at 11:17 PM, Mok wrote: > > Hi, > > We've started to observe instances of one of our databases stalling for a > few seconds. > > We see a spike in wal write locks then nothing for a few seconds. After > which we have spike latency as processes waiting to get to the db can do > so. > > There is nothing in the postgres logs that give us any clues to what could > be happening, no locks, unusually high/long running transactions, just a > pause and resume. > > Could anyone give me any advice as to what to look for when it comes to > checking the underlying disk that the db is on? > > Thanks, > > Gurmokh > > > >
Fwd: Database Stalls
On Mon, Jan 30, 2023 at 4:32 PM Mok wrote: > Hi, > > Unfortunately there is no pg_stat_activity data available as we are > unaware of the issue until it has already happened. > > The version we are on is 12.11. > > I don't think it is due to locks as there are none in the logs. Vacuums > are logged also and none occur before or after this event. Checkpoint > timeout is set to 1 hour and these events do not coincide with checkpoints. > > Gurmokh > >> >> Have you eliminated network issues? I have seen what looks like a database stalling to end up actually being the network packets taking a side trip to halfway around the world for a while. Or DNS lookups suddenly taking a really long time. The next most likely thing is disk i/o. Do you have huge corresponding disk i/o spikes or does it drop completely to zero (which is also bad - especially if you are on a SAN and you can't get any packets out on that network). You'll have to look at your disks via OS tools to see. Do you have any hardware faults? Errors on a hardware bus? Overheating? I used to have a system that would freeze up entirely due to a problem with a serial port that we had a console attached to - it was sending a low level interrupt. Sometimes it would recover mysteriously if someone hit the carriage return a couple times. Ie, is it _really_ the database that is locking up, or is it your hardware?
