many backends hang on MultiXactOffsetSLRU

2024-09-10 Thread James Pang
Hi experts,
we have a Postgresql v14.8 database, almost thousands of backends hang
on MultiXactOffsetSLRU at the same time, all of these sessions running same
query "SELECT ", from OS and postgresql slow log, we found all of these
query on "BIND" stage.
 LOG:  duration: 36631.688 ms  bind S_813: SELECT
LOG:  duration: 36859.786 ms  bind S_: SELECT
LOG:  duration: 35868.148 ms  bind : SELECT
LOG:  duration: 36906.471 ms  bind : SELECT
LOG:  duration: 35955.489 ms  bind : SELECT
LOG:  duration: 36833.510 ms  bind : SELECT
LOG:  duration: 36839.535 ms  bind S_1219: SELECT
...

this database hang on MultiXactOffsetSLRU and MultiXactOffsetBuffer long
time.

could you direct me why they are hanging on 'BIND‘ stage with
MultiXactOffsetSLRU ?

Thanks,

James


Re: many backends hang on MultiXactOffsetSLRU

2024-09-10 Thread Amine Tengilimoglu
 Hi,

   I encountered this in a project we migrated to PostgreSQL
before, and unfortunately, it’s a situation that completely degrades
performance. We identified the cause as savepoints being used excessively
and without control. Once they reduced the number of savepoints, the issue
was resolved. However, the documentation also mentions that it could be
caused by foreign keys.


  Kind regards..

James Pang , 10 Eyl 2024 Sal, 10:33 tarihinde şunu
yazdı:

> Hi experts,
> we have a Postgresql v14.8 database, almost thousands of backends hang
> on MultiXactOffsetSLRU at the same time, all of these sessions running same
> query "SELECT ", from OS and postgresql slow log, we found all of these
> query on "BIND" stage.
>  LOG:  duration: 36631.688 ms  bind S_813: SELECT
> LOG:  duration: 36859.786 ms  bind S_: SELECT
> LOG:  duration: 35868.148 ms  bind : SELECT
> LOG:  duration: 36906.471 ms  bind : SELECT
> LOG:  duration: 35955.489 ms  bind : SELECT
> LOG:  duration: 36833.510 ms  bind : SELECT
> LOG:  duration: 36839.535 ms  bind S_1219: SELECT
> ...
>
> this database hang on MultiXactOffsetSLRU and MultiXactOffsetBuffer long
> time.
>
> could you direct me why they are hanging on 'BIND‘ stage with
> MultiXactOffsetSLRU ?
>
> Thanks,
>
> James
>
>
>
>
>


Re: many backends hang on MultiXactOffsetSLRU

2024-09-10 Thread Alvaro Herrera
On 2024-Sep-10, James Pang wrote:

> Hi experts,
> we have a Postgresql v14.8 database, almost thousands of backends hang
> on MultiXactOffsetSLRU at the same time, all of these sessions running same
> query "SELECT ", from OS and postgresql slow log, we found all of these
> query on "BIND" stage.
>  LOG:  duration: 36631.688 ms  bind S_813: SELECT
> LOG:  duration: 36859.786 ms  bind S_: SELECT
> LOG:  duration: 35868.148 ms  bind : SELECT
> LOG:  duration: 36906.471 ms  bind : SELECT
> LOG:  duration: 35955.489 ms  bind : SELECT
> LOG:  duration: 36833.510 ms  bind : SELECT
> LOG:  duration: 36839.535 ms  bind S_1219: SELECT
> ...
> 
> this database hang on MultiXactOffsetSLRU and MultiXactOffsetBuffer long
> time.
> 
> could you direct me why they are hanging on 'BIND‘ stage with
> MultiXactOffsetSLRU ?

Very likely, it's related to this problem
[1] https://thebuild.com/blog/2023/01/18/a-foreign-key-pathology-to-avoid/

This is caused by a suboptimal implementation of what we call SLRU,
which multixact uses underneath.  For years, many people dodged this
problem by recompiling with a changed value for
NUM_MULTIXACTOFFSET_BUFFERS in src/include/access/multixact.h (it was
originally 8 buffers, which is very small); you'll need to do that in
all releases up to pg16.  In pg17 this was improved[2] and you'll be
able to change the value in postgresql.conf, though the default already
being larger than the original (16 instead of 8), you may not need to.

[2] 
https://pgconf.in/files/presentations/2023/Dilip_Kumar_RareExtremelyChallengingPostgresPerformanceProblems.pdf
[3] 
https://www.pgevents.ca/events/pgconfdev2024/schedule/session/53-problem-in-postgresql-slru-and-how-we-are-optimizing-it/

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La victoria es para quien se atreve a estar solo"




Re: many backends hang on MultiXactOffsetSLRU

2024-09-10 Thread Alvaro Herrera
On 2024-Sep-10, Amine Tengilimoglu wrote:

>  Hi,
> 
>I encountered this in a project we migrated to PostgreSQL
> before, and unfortunately, it’s a situation that completely degrades
> performance. We identified the cause as savepoints being used excessively
> and without control. Once they reduced the number of savepoints, the issue
> was resolved. However, the documentation also mentions that it could be
> caused by foreign keys.

Yeah, it's exactly the same problem; when it comes from savepoints the
issue is pg_subtrans, and when foreign keys are involved, it is
pg_multixact.  Both of those use the SLRU subsystem, which was heavily
modified in pg17 as I mentioned in my reply to James.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)




Re: many backends hang on MultiXactOffsetSLRU

2024-09-10 Thread James Pang
There is no foreign keys, but there is one session who did transactions
to tables with savepoints, one savepoints/per sql in same transaction. But
sessions with query "SELECT “ do not use savepoints , just with a lot of
sessions running same query and hang on MultiXact suddenly.  even only one
session doing DML with savepoints , and all other queries sessions can see
this kind of "MultiXact" waiting ,right?


James Pang  於 2024年9月10日週二 下午4:26寫道:

>   There is no foreign keys, but there are several sessions who did
> transactions to tables with savepoints, one savepoints/per sql in same
> transaction. But sessions with query "SELECT “ do not use savepoints , just
> with a lot of sessions running same query and hang on MultiXact suddenly.
>
> Alvaro Herrera  於 2024年9月10日週二 下午4:15寫道:
>
>> On 2024-Sep-10, Amine Tengilimoglu wrote:
>>
>> >  Hi,
>> >
>> >I encountered this in a project we migrated to PostgreSQL
>> > before, and unfortunately, it’s a situation that completely degrades
>> > performance. We identified the cause as savepoints being used
>> excessively
>> > and without control. Once they reduced the number of savepoints, the
>> issue
>> > was resolved. However, the documentation also mentions that it could be
>> > caused by foreign keys.
>>
>> Yeah, it's exactly the same problem; when it comes from savepoints the
>> issue is pg_subtrans, and when foreign keys are involved, it is
>> pg_multixact.  Both of those use the SLRU subsystem, which was heavily
>> modified in pg17 as I mentioned in my reply to James.
>>
>> --
>> Álvaro Herrera   48°01'N 7°57'E  —
>> https://www.EnterpriseDB.com/
>> "I think my standards have lowered enough that now I think 'good design'
>> is when the page doesn't irritate the living f*ck out of me." (JWZ)
>>
>


Re: many backends hang on MultiXactOffsetSLRU

2024-09-10 Thread Amine Tengilimoglu
  I hadn't found a satisfactory explanation about the top limit
related to SLRU, so this document will be useful. It's a nice development
that the relevant limit has been increased in pg17; I hope I don't
encounter a situation where this limit is exceeded in large systems.


Kind regards..

James Pang , 10 Eyl 2024 Sal, 11:35 tarihinde şunu
yazdı:

> There is no foreign keys, but there is one session who did
> transactions to tables with savepoints, one savepoints/per sql in same
> transaction. But sessions with query "SELECT “ do not use savepoints , just
> with a lot of sessions running same query and hang on MultiXact suddenly.
> even only one session doing DML with savepoints , and all other
> queries sessions can see this kind of "MultiXact" waiting ,right?
>
>
> James Pang  於 2024年9月10日週二 下午4:26寫道:
>
>>   There is no foreign keys, but there are several sessions who did
>> transactions to tables with savepoints, one savepoints/per sql in same
>> transaction. But sessions with query "SELECT “ do not use savepoints , just
>> with a lot of sessions running same query and hang on MultiXact suddenly.
>>
>> Alvaro Herrera  於 2024年9月10日週二 下午4:15寫道:
>>
>>> On 2024-Sep-10, Amine Tengilimoglu wrote:
>>>
>>> >  Hi,
>>> >
>>> >I encountered this in a project we migrated to PostgreSQL
>>> > before, and unfortunately, it’s a situation that completely degrades
>>> > performance. We identified the cause as savepoints being used
>>> excessively
>>> > and without control. Once they reduced the number of savepoints, the
>>> issue
>>> > was resolved. However, the documentation also mentions that it could be
>>> > caused by foreign keys.
>>>
>>> Yeah, it's exactly the same problem; when it comes from savepoints the
>>> issue is pg_subtrans, and when foreign keys are involved, it is
>>> pg_multixact.  Both of those use the SLRU subsystem, which was heavily
>>> modified in pg17 as I mentioned in my reply to James.
>>>
>>> --
>>> Álvaro Herrera   48°01'N 7°57'E  —
>>> https://www.EnterpriseDB.com/
>>> "I think my standards have lowered enough that now I think 'good design'
>>> is when the page doesn't irritate the living f*ck out of me." (JWZ)
>>>
>>


Re: many backends hang on MultiXactOffsetSLRU

2024-09-10 Thread Alvaro Herrera
On 2024-Sep-10, James Pang wrote:

> There is no foreign keys, but there is one session who did transactions
> to tables with savepoints, one savepoints/per sql in same transaction. But
> sessions with query "SELECT “ do not use savepoints , just with a lot of
> sessions running same query and hang on MultiXact suddenly.  even only one
> session doing DML with savepoints , and all other queries sessions can see
> this kind of "MultiXact" waiting ,right?

I think SELECT FOR UPDATE combined with savepoints can create
multixacts, in absence of foreign keys.

A query that's waiting doesn't need to have *created* the multixact or
subtrans -- it is sufficient that it's forced to look it up.

If thousands of sessions tried to look up different multixact values
(spread across more than 8 pages), then thrashing of the cache would
result, with catastrophic performance.  This can probably be caused by
some operation that creates one multixact per tuple in a few thousand
tuples.

Maybe you could ease this by doing VACUUM on the table (perhaps with a
low multixact freeze age), which might remove some of the multixacts.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Para tener más hay que desear menos"