Postgresql 16.3 Out Of Memory

2024-06-03 Thread Radu Radutiu
Hello,

I have an out of memory problem after upgrading from postgresql 12 to 16.3.
I have identified one query that can reproduce the error on demand. Once
the query starts, it will eventually exhaust all RAM and swap until the OOM
killer will stop postgresql.
The setup is as follows:
- One VMWARE VM, RHEL 8.9, 64 GB RAM, 16 vCPU, 32 GB swap. Database is used
by a local java application that takes up to 22 GB (hard limit). The
application uses 2 db connections (long running, app uses a connection
pool). The database has about 10 tables, with 3 large tables involved in
the problem query:
 relation   | total_size
---+
 outputrequest | 217 GB
 inputrequest  | 149 GB
 tran  | 5041 MB

I have the following settings:
shared_buffers = '8192MB'
effective_cache_size = '24GB'
maintenance_work_mem = '2GB'
checkpoint_completion_target = '0.9'
wal_buffers = '16MB'
default_statistics_target = '100'
random_page_cost = '1.1'
effective_io_concurrency = '200'
work_mem = '104857kB'
min_wal_size = '1GB'
max_wal_size = '4GB'
max_worker_processes = '16'
max_parallel_workers_per_gather = '4'
max_parallel_workers = '16'
max_parallel_maintenance_workers = '4'
jit = 'off'

Beside OOM there is another (possibly related)problem: the postgresql
memory usage is significantly higher than the shared_buffers + the rest of
the memory used per connection. On my system with shared buffers 8G and
work_mem ~ 100M I have memory usage of 20GB with the system running just
simple inserts (the memory might have increased due to previous queries).

I have found out that if I set  enable_parallel_hash = 'off', I do get high
memory usage for the problem query (above 30G) but not OOM. With the
enable_parallel_hash = 'on', I will always get OOM. For simplicity I've
restricted postgresql systemd service to cap the memory at 36 GB (so that I
don't have to wait for the 32 GB swap to be filled before OOM).  This is
the memory usage I get when running the query, before systemd restarts
postgres - memory usage is 35.9 GB just below the 36 GB limit (the query is
running with explain analyze,buffers, the other 2 idle connections are from
the application):

[postgres@ips3 ~]$ systemctl status postgresql-16; top -u postgres -n 1 -c
● postgresql-16.service - PostgreSQL 16 database server
   Loaded: loaded (/etc/systemd/system/postgresql-16.service; enabled;
vendor preset: disabled)
  Drop-In: /etc/systemd/system/postgresql-16.service.d
   └─override.conf
   Active: active (running) since Mon 2024-06-03 04:23:16 +08; 11h ago
 Docs: https://www.postgresql.org/docs/16/static/
  Process: 283703 ExecStartPre=/usr/pgsql-16/bin/postgresql-16-check-db-dir
${PGDATA} (code=exited, status=0/SUCCESS)
 Main PID: 283709 (postgres)
Tasks: 14 (limit: 408404)
   Memory: 35.9G (limit: 36.0G)
   CGroup: /system.slice/postgresql-16.service
   ├─283709 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
   ├─283710 postgres: logger
   ├─295499 postgres: checkpointer
   ├─295500 postgres: background writer
   ├─295506 postgres: walwriter
   ├─295507 postgres: autovacuum launcher
   ├─295508 postgres: logical replication launcher
   ├─295511 postgres: db1 db1 127.0.0.1(37720) idle
   ├─295946 postgres: postgres db1 [local] EXPLAIN
   ├─295947 postgres: parallel worker for PID 295946
   ├─295948 postgres: parallel worker for PID 295946
   ├─295949 postgres: parallel worker for PID 295946
   ├─295950 postgres: parallel worker for PID 295946
   └─296037 postgres: db1 db1 127.0.0.1(41708) idle

top - 16:14:32 up 14 days,  4:42,  1 user,  load average: 5.41, 3.58, 2.25
Tasks: 405 total,   6 running, 399 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.1 us,  5.2 sy,  0.0 ni, 71.2 id, 21.3 wa,  0.7 hi,  0.4 si,
 0.0 st
MiB Mem :  63873.4 total,   2162.0 free,  57177.8 used,   4533.6 buff/cache
MiB Swap:  32260.0 total,   3813.8 free,  28446.2 used.   4349.0 avail Mem

PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+
COMMAND


 295947 postgres  20   0   21.2g   7.4g 654168 R  20.0  11.8   1:31.81
postgres: parallel worker for PID 295946


 295949 postgres  20   0   21.3g   7.5g 656316 R  20.0  12.0   1:32.57
postgres: parallel worker for PID 295946


 295950 postgres  20   0   21.2g   7.5g 654756 R  20.0  12.1   1:31.99
postgres: parallel worker for PID 295946


 295946 postgres  20   0   21.2g   7.4g 655684 R  13.3  11.8   1:31.86
postgres: postgres db1 [local] EXPLAIN


 295948 postgres  20   0   21.6g   6.0g 656024 R  13.3   9.6   1:37.91
postgres: parallel worker for PID 295946


 283709 postgres  20   0 8960628  44144  43684 S   0.0   0.1   0:42.28
/usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/


 283710 postgres  20   0  355288   4876   4576 S   0.0   0.0   0:04.42
postgres: logger

 295499 postgres  20   0 8960936 107376 106868 S   0.0   0.2   0:01.75
postgres: checkpointer


 295500 postg

Re: Postgresql 16.3 Out Of Memory

2024-06-03 Thread Greg Sabino Mullane
On Mon, Jun 3, 2024 at 6:19 AM Radu Radutiu  wrote:

> Do you have any idea how to further debug the problem?
>

Putting aside the issue of non-reclaimed memory for now, can you show us
the actual query? The explain analyze you provided shows it doing an awful
lot of joins and then returning 14+ million rows to the client. Surely the
client does not need that many rows?

You can probably also prevent OOM by lowering your work_mem and/or your
max_parallel_workers_per_gather.

Cheers,
Greg


Re: Postgresql 16.3 Out Of Memory

2024-06-03 Thread Ron Johnson
On Mon, Jun 3, 2024 at 9:12 AM Greg Sabino Mullane 
wrote:

> On Mon, Jun 3, 2024 at 6:19 AM Radu Radutiu  wrote:
>
>> Do you have any idea how to further debug the problem?
>>
>
> Putting aside the issue of non-reclaimed memory for now, can you show us
> the actual query? The explain analyze you provided shows it doing an awful
> lot of joins and then returning 14+ million rows to the client. Surely the
> client does not need that many rows?
>

And the query cost is really high.  "Did you ANALYZE the instance after
conversion?" was my first question.


Unexpected results from CALL and AUTOCOMMIT=off

2024-06-03 Thread Victor Yegorov
Greetings.

I am observing the following results on PostgreSQL 15.7
First, setup:

create table t_test(x bigint);
insert into t_test values(0);

create or replace function f_get_x()
returns bigint
language plpgsql
stable
as $function$
declare
l_result bigint;
begin
select x into l_result from t_test;
--raise notice 'f_get_x() >> x=%', l_result;
--raise notice 'f_get_x() >> xact=%', txid_current_if_assigned();
return l_result;
end;
$function$;

create or replace procedure f_print_x(x bigint)
language plpgsql
as $procedure$
begin
raise notice 'f_print_x() >> x=%', x;
--raise notice 'f_print_x() >> xact=%', txid_current_if_assigned();
end;
$procedure$;


Now, the case:
\set AUTOCOMMIT off
do
$$ begin
--raise notice 'do >> xact=%', txid_current_if_assigned();
update t_test set x = 1;
--raise notice 'do >> xact=%', txid_current_if_assigned();
raise notice 'do >> x=%', f_get_x();
--raise notice 'do >> xact=%', txid_current_if_assigned();
call f_print_x(f_get_x());
end; $$;
NOTICE:  do >> x=1
NOTICE:  f_print_x() >> x=0
DO

I don't understand why CALL statement is not seeing an updated record.
With AUTOCOMMIT=on, all goes as expected.

I tried to examine snapshots and xids (commented lines), but they're always
the same.

Can you explain this behavior, please? Is it expected?

-- 
Victor Yegorov


Re: Unexpected results from CALL and AUTOCOMMIT=off

2024-06-03 Thread Pierre Forstmann
You declared function f_get_x as stable which means:

https://www.postgresql.org/docs/15/sql-createfunction.html

STABLE indicates that the function cannot modify the database, and that
within a single table scan it will consistently return the same result for
the same argument values, but that its result could change across SQL
statements. This is the appropriate selection for functions whose results
depend on database lookups, parameter variables (such as the current time
zone), etc. (It is inappropriate for AFTER triggers that wish to query rows
modified by the current command.) Also note that the current_timestamp
family of functions qualify as stable, since their values do not change
within a transaction.

If you remove stable from function declaration, it works as expected:

drop table t_test;
DROP TABLE
create table t_test(x bigint);
CREATE TABLE
insert into t_test values(0);
INSERT 0 1
create or replace function f_get_x()
returns bigint
language plpgsql
-- stable
as $function$
declare
l_result bigint;
begin
select x into l_result from t_test;
--raise notice 'f_get_x() >> x=%', l_result;
--raise notice 'f_get_x() >> xact=%', txid_current_if_assigned();
return l_result;
end;
$function$;
CREATE FUNCTION
create or replace procedure f_print_x(x bigint)
language plpgsql
as $procedure$
begin
raise notice 'f_print_x() >> x=%', x;
--raise notice 'f_print_x() >> xact=%', txid_current_if_assigned();
end;
$procedure$;
CREATE PROCEDURE
do
$$ begin
--raise notice 'do >> xact=%', txid_current_if_assigned();
update t_test set x = 1;
--raise notice 'do >> xact=%', txid_current_if_assigned();
raise notice 'do >> x=%', f_get_x();
--raise notice 'do >> xact=%', txid_current_if_assigned();
call f_print_x(f_get_x());
end; $$;
psql:test.sql:38: NOTICE:  do >> x=1
psql:test.sql:38: NOTICE:  f_print_x() >> x=1
DO

Le lun. 3 juin 2024 à 16:42, Victor Yegorov  a écrit :

> Greetings.
>
> I am observing the following results on PostgreSQL 15.7
> First, setup:
>
> create table t_test(x bigint);
> insert into t_test values(0);
>
> create or replace function f_get_x()
> returns bigint
> language plpgsql
> stable
> as $function$
> declare
> l_result bigint;
> begin
> select x into l_result from t_test;
> --raise notice 'f_get_x() >> x=%', l_result;
> --raise notice 'f_get_x() >> xact=%', txid_current_if_assigned();
> return l_result;
> end;
> $function$;
>
> create or replace procedure f_print_x(x bigint)
> language plpgsql
> as $procedure$
> begin
> raise notice 'f_print_x() >> x=%', x;
> --raise notice 'f_print_x() >> xact=%', txid_current_if_assigned();
> end;
> $procedure$;
>
>
> Now, the case:
> \set AUTOCOMMIT off
> do
> $$ begin
> --raise notice 'do >> xact=%', txid_current_if_assigned();
> update t_test set x = 1;
> --raise notice 'do >> xact=%', txid_current_if_assigned();
> raise notice 'do >> x=%', f_get_x();
> --raise notice 'do >> xact=%', txid_current_if_assigned();
> call f_print_x(f_get_x());
> end; $$;
> NOTICE:  do >> x=1
> NOTICE:  f_print_x() >> x=0
> DO
>
> I don't understand why CALL statement is not seeing an updated record.
> With AUTOCOMMIT=on, all goes as expected.
>
> I tried to examine snapshots and xids (commented lines), but they're
> always the same.
>
> Can you explain this behavior, please? Is it expected?
>
> --
> Victor Yegorov
>


Re: Unexpected results from CALL and AUTOCOMMIT=off

2024-06-03 Thread Victor Yegorov
пн, 3 июн. 2024 г. в 20:40, Pierre Forstmann :

> You declared function f_get_x as stable which means:
>
> …
>
> If you remove stable from function declaration, it works as expected:
>

Well, I checked
https://www.postgresql.org/docs/current/xfunc-volatility.html
There's a paragraph describing why STABLE (and IMMUTABLE) use different
snapshots:

> For functions written in SQL or in any of the standard procedural
languages, there is a second important property determined by the
volatility category, namely the visibility of any data changes that have
been made by the SQL command that is calling the function. A > VOLATILE
function will see such changes, a STABLE or IMMUTABLE function will not.
This behavior is implemented using the snapshotting behavior of MVCC (see
Chapter 13): STABLE and IMMUTABLE functions use a snapshot established as
of the start of the
> calling query, whereas VOLATILE functions obtain a fresh snapshot at the
start of each query they execute.

But later, docs state, that

> Because of this snapshotting behavior, a function containing only SELECT
commands can safely be marked STABLE, even if it selects from tables that
might be undergoing modifications by concurrent queries. PostgreSQL will
execute all commands of a STABLE function using the snapshot established
for the calling query, and so it will see a fixed view of the database
throughout that query.

And therefore I assume STABLE should work in this case. Well, it seems not
to.

I assume there's smth to do with implicit BEGIN issued in non-AUTOCOMMIT
mode and non-atomic DO block behaviour.


-- 
Victor Yegorov


Re: Unexpected results from CALL and AUTOCOMMIT=off

2024-06-03 Thread Tom Lane
Victor Yegorov  writes:
> пн, 3 июн. 2024 г. в 20:40, Pierre Forstmann :
>> If you remove stable from function declaration, it works as expected:

> ... therefore I assume STABLE should work in this case. Well, it seems not
> to.

I agree that this looks like a bug, since your example shows that the
same function works as-expected in an ordinary expression but not in
a CALL.  The dependency on AUTOCOMMIT (that is, being within an outer
transaction block) seems even odder.  I've not dug into it yet, but
I suppose we're passing the wrong snapshot to the CALL arguments.
A volatile function wouldn't use that snapshot, explaining Pierre's
result.

regards, tom lane