Postgresql 13.7 hangs down

2022-07-04 Thread Bogdan Siara
Hi all,
I have a postgresql cluster (13.7) with two nodes (master,slave) compiled
in docker (alpine 3.15) with flags:

./configure --prefix=${PG_DIR} --exec-prefix=${PG_DIR}
--enable-integer-datetimes --enable-thread-safety --disable-rpath
--with-uuid=e2fs --with-gnu-ld --with-pgport=5432
--with-system-tzdata=/usr/share/zoneinfo --with-llvm --with-gssapi
--with-ldap --with-icu --with-tcl --with-perl --with-python --with-pam
--with-openssl --with-libxml --with-libxslt

and running with the following configuration:

listen_addresses = '*'
port = 5432
max_connections = 1000
unix_socket_directories = '/opt/pg/data'
superuser_reserved_connections = 3
shared_buffers = 6GB
temp_buffers = 32MB
max_prepared_transactions = 100
work_mem = 1146kB
maintenance_work_mem = 1792MB
max_stack_depth = 4MB
dynamic_shared_memory_type = posix
archive_command = '/opt/pg/bin/pgbackrest --stanza=aws-prdan archive-push
%p'
archive_mode = on
max_wal_senders = 10
min_wal_size = 2GB
max_wal_size = 3GB
wal_level = logical
checkpoint_completion_target = 0.9
effective_cache_size = 18GB
log_destination = 'stderr'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%Y-%m-%d.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
log_checkpoints = on
log_line_prefix = '%m [%p] - [%a - %u - %d] [%h] : %e'
log_min_duration_statement = 0
stats_temp_directory = '/opt/pg/pg_stat_tmp'
autovacuum = on
autovacuum_max_workers = 3
autovacuum_vacuum_threshold = 50
autovacuum_analyze_threshold = 50
datestyle = 'iso, dmy'
default_text_search_config = 'pg_catalog.simple'
jit = on
jit_above_cost = 10
jit_debugging_support = off
jit_dump_bitcode = off
jit_expressions = on
jit_inline_above_cost = 50
jit_optimize_above_cost = 50
jit_profiling_support = off
jit_provider = llvmjit
jit_tuple_deforming = on
max_worker_processes = 8
max_parallel_workers_per_gather = 2
max_parallel_workers = 2
default_statistics_target = 100
synchronous_commit = off
random_page_cost = 1.1
effective_io_concurrency = 200
track_activity_query_size = 1
pg_stat_statements.track = all

Sometimes my cluster nodes hangs down and not responging. Logs ends without
any error:

2022-07-02 00:42:51.755 P00   INFO: archive-push command begin 2.39:
[pg_wal/00015596002C] --archive-async --compress-type=lz4
--exec-id=787394-196ba324 --log-level-console=info
--log-level-file=detail --pg1-path=/opt/pg/data --process-max=4
--repo1-cipher-pass= --repo1-cipher-type=aes-256-cbc
--repo1-path=/repo-path --repo1-s3-bucket=backup-postgresql
 --repo1-s3-endpoint=s3.eu-central-1.amazonaws.com
--repo1-s3-key= --repo1-s3-key-secret=
--repo1-s3-region=eu-central-1 --repo1-type=s3 --stanza=aws-prdan
2022-07-02 00:42:51.755 P00   INFO: pushed WAL file
'00015596002C' to the archive asynchronously
2022-07-02 00:42:51.755 P00   INFO: archive-push command end: completed
successfully (1ms)
2022-07-02 07:54:34.333 GMT [15] - [ -  - ] [] : 0LOG:  starting
PostgreSQL 13.7 on x86_64-pc-linux-musl, compiled by gcc (Alpine
10.3.1_git20211027) 10.3.1 20211027, 64-bit
2022-07-02 07:54:34.335 GMT [15] - [ -  - ] [] : 0LOG:  listening on
IPv4 address "0.0.0.0", port 5432

I can login to database using psql but can't execute command, I can't stop
database kill -INT, kill -QUIT, docker stop. Database size on zfs volume
already have 3,8T. At the end I restart entry host and all back to normal
work. Situation occurs on master and slave nodes.

Please tell me what I should do to trace and fix the problem.

Thanks for your attention.
Regards
BS


Postgrsql blocked for more than 120 s

2023-07-16 Thread Bogdan Siara
Hi, I have a problem with my slave postgresql instance. During normal
operations databaseis locked and in syslog I see stacktrace:

Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.289110] INFO: task
postgres:1172 blocked for more than 120 seconds.
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.292137]   Not
tainted 5.19.0-1028-aws #29~22.04.1-Ubuntu
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.295239] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299360] task:postgres
 state:D stack:0 pid: 1172 ppid:  1133 flags:0x0004
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299366] Call Trace:
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299369]  
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299373]
 __schedule+0x254/0x5a0
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299381]
 schedule+0x5d/0x100
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299385]
 io_schedule+0x46/0x80
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299389]
 blk_mq_get_tag+0x117/0x300
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299394]  ?
destroy_sched_domains_rcu+0x40/0x40
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299399]
 __blk_mq_alloc_requests+0xc4/0x1e0
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299404]
 blk_mq_get_new_requests+0xcc/0x190
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299408]
 blk_mq_submit_bio+0x1eb/0x450
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299411]
 __submit_bio+0xf6/0x190
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299414]  ?
kmem_cache_alloc+0x1a6/0x2f0
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299419]
 __submit_bio_noacct+0x81/0x1f0
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299422]
 submit_bio_noacct_nocheck+0x104/0x1c0
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299425]  ?
ext4_inode_block_valid+0x1d/0x40
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299430]
 submit_bio_noacct+0x1b9/0x600
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299433]
 submit_bio+0x40/0xf0
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299437]
 ext4_mpage_readpages+0x254/0xc20
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299441]  ?
xas_load+0x1f/0x100
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299447]  ?
__filemap_add_folio+0x1ca/0x540
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299452]
 ext4_readahead+0x3f/0x50
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299455]
 read_pages+0x7b/0x2e0
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299459]
 page_cache_ra_unbounded+0x12d/0x180
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299462]
 force_page_cache_ra+0xc5/0x100
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299466]
 generic_fadvise+0x19d/0x280
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299469]
 ksys_fadvise64_64+0x9f/0xb0
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299471]  ?
syscall_trace_enter.constprop.0+0xb5/0x1a0
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299475]
 __x64_sys_fadvise64+0x1c/0x30
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299478]
 do_syscall_64+0x5c/0x90
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299482]  ?
exc_page_fault+0x92/0x190
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299485]
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299489] RIP:
0033:0x7f7358c993c2
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299493] RSP:
002b:7ffd448a8a88 EFLAGS: 0206 ORIG_RAX: 00dd
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299498] RAX:
ffda RBX: 0383 RCX: 7f7358c993c2
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299501] RDX:
2000 RSI: 099a RDI: 0389
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299503] RBP:
2000 R08: 0002 R09: 
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299506] R10:
0003 R11: 0206 R12: 099a
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299508] R13:
0a10 R14: 01504cd0 R15: 7ffd448a8ad0
Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299514]  
Jul 16 22:36:13 ec1-prd-postgresan-2 kernel: [ 1452.119882] INFO: task
postgres:1172 blocked for more than 241 seconds.
Jul 16 22:36:13 ec1-prd-postgresan-2 kernel: [ 1452.122832]   Not
tainted 5.19.0-1028-aws #29~22.04.1-Ubuntu
Jul 16 22:36:13 ec1-prd-postgresan-2 kernel: [ 1452.128443] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 16 22:36:13 ec1-prd-postgresan-2 kernel: [ 1452.132098] task:postgres
 state:D stack:0 pid: 1172 ppid:  1133 flags:0x0004
Jul 16 22:36:13 ec1-prd-postgresan-2 kernel: [ 1452.132106] Call Trace:
Jul 16 22:36:13 ec1-prd-postgresan-2 kernel: [ 1452.132109]  
Jul 16 22:36:13 ec1-prd-postgresan-2 kernel: 

Re: Postgrsql blocked for more than 120 s

2023-07-20 Thread Bogdan Siara
Hi,
Problem was on disk io starvation, storage migration to more iops resolve
the problem.
Regards
BS

pon., 17 lip 2023 o 07:40 Bogdan Siara  napisał(a):

> Hi, I have a problem with my slave postgresql instance. During normal
> operations databaseis locked and in syslog I see stacktrace:
>
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.289110] INFO: task
> postgres:1172 blocked for more than 120 seconds.
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.292137]   Not
> tainted 5.19.0-1028-aws #29~22.04.1-Ubuntu
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.295239] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299360] task:postgres
>state:D stack:0 pid: 1172 ppid:  1133 flags:0x0004
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299366] Call Trace:
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299369]  
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299373]
>  __schedule+0x254/0x5a0
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299381]
>  schedule+0x5d/0x100
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299385]
>  io_schedule+0x46/0x80
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299389]
>  blk_mq_get_tag+0x117/0x300
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299394]  ?
> destroy_sched_domains_rcu+0x40/0x40
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299399]
>  __blk_mq_alloc_requests+0xc4/0x1e0
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299404]
>  blk_mq_get_new_requests+0xcc/0x190
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299408]
>  blk_mq_submit_bio+0x1eb/0x450
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299411]
>  __submit_bio+0xf6/0x190
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299414]  ?
> kmem_cache_alloc+0x1a6/0x2f0
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299419]
>  __submit_bio_noacct+0x81/0x1f0
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299422]
>  submit_bio_noacct_nocheck+0x104/0x1c0
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299425]  ?
> ext4_inode_block_valid+0x1d/0x40
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299430]
>  submit_bio_noacct+0x1b9/0x600
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299433]
>  submit_bio+0x40/0xf0
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299437]
>  ext4_mpage_readpages+0x254/0xc20
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299441]  ?
> xas_load+0x1f/0x100
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299447]  ?
> __filemap_add_folio+0x1ca/0x540
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299452]
>  ext4_readahead+0x3f/0x50
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299455]
>  read_pages+0x7b/0x2e0
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299459]
>  page_cache_ra_unbounded+0x12d/0x180
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299462]
>  force_page_cache_ra+0xc5/0x100
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299466]
>  generic_fadvise+0x19d/0x280
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299469]
>  ksys_fadvise64_64+0x9f/0xb0
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299471]  ?
> syscall_trace_enter.constprop.0+0xb5/0x1a0
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299475]
>  __x64_sys_fadvise64+0x1c/0x30
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299478]
>  do_syscall_64+0x5c/0x90
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299482]  ?
> exc_page_fault+0x92/0x190
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299485]
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299489] RIP:
> 0033:0x7f7358c993c2
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299493] RSP:
> 002b:7ffd448a8a88 EFLAGS: 0206 ORIG_RAX: 00dd
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299498] RAX:
> ffda RBX: 0383 RCX: 7f7358c993c2
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299501] RDX:
> 2000 RSI: 099a RDI: 0389
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299503] RBP:
> 2000 R08: 0002 R09: 
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299506] R10:
> 0003 R11: 0206 R12: 099a
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299508] R13:
> 0a10 R14: 01504cd0 R15: 7ffd448a8ad0
> Jul 16 22:34:12 ec1-prd-postgresan-2 kernel: [ 1331.299514]  
> Jul 16 22:36:13 ec1-prd-postgresan-2 kernel: [ 1452.119882] INFO: task
> po

huge_pages=on cause could not map anonymous shared memory: Cannot allocate memory

2025-05-07 Thread Bogdan Siara
Hi,
I have problem to run postgresql on ubuntu 24.04 server with huge_pages =
on. My instance have 8GB ram and 2 vcpus (t3a.large). My configuration is:

max_connections = 1000
superuser_reserved_connections = 3
shared_buffers = 1960MB
effective_cache_size = 5881MB
huge_pages = on
temp_buffers = 32MB
max_prepared_transactions = 100
work_mem = 1MB
maintenance_work_mem = 392MB
max_stack_depth = 4MB
dynamic_shared_memory_type = posix
archive_mode = on
max_wal_senders = 10
wal_sender_timeout = 0
min_wal_size = 4GB
max_wal_size = 16GB
wal_level = logical
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9

In my sysct.conf I have:

vm.overcommit_memory=2
vm.overcommit_ratio=50
vm.vfs_cache_pressure=50
vm.dirty_background_ratio=10
vm.dirty_ratio=40
fs.nr_open=1000
fs.file-max=1000
vm.nr_hugepages=980
vm.hugetlb_shm_group=1010

my huge meminfo looks:

cat /proc/meminfo | grep -i huge
AnonHugePages: 0 kB
ShmemHugePages:0 kB
FileHugePages: 0 kB
HugePages_Total: 980
HugePages_Free:  980
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
Hugetlb: 2007040 kB

When I start postgres instance I get error:
May 07 10:56:32 pg1 pg_ctl[16753]: server starting
May 07 10:56:32 pg1 systemd[1]: Started postgres-17.4-local.service -
PostgreSQL 17.4 database local server.
May 07 10:56:35 pg1 pg_ctl[16756]: 2025-05-07 10:56:35.826 CEST [16756] - [
-  - ] [] : XX000FATAL:  could not map anonymous shared memory: Cannot
allocate memory
May 07 10:56:35 pg1 pg_ctl[16756]: 2025-05-07 10:56:35.826 CEST [16756] - [
-  - ] [] : XX000HINT:  This error usually means that PostgreSQL's request
for a shared memory segment exceeded available memory, swap space, or huge
pages. To reduce the request size (currently 2204106752 bytes), reduce
PostgreSQL's shared memory usage, perhaps by reducing "shared_buffers"
or "max_connections".
May 07 10:56:35 pg1 pg_ctl[16756]: 2025-05-07 10:56:35.826 CEST [16756] - [
-  - ] [] : 0LOG:  database system is shut down
May 07 10:56:35 pg1 systemd[1]: postgres-17.4-sdx.service: Main process
exited, code=exited, status=1/FAILURE
May 07 10:56:35 pg1 systemd[1]: postgres-17.4-sdx.service: Failed with
result 'exit-code'.
May 07 10:56:35 pg1 systemd[1]: postgres-17.4-sdx.service: Consumed 3.493s
CPU time, 2.8M memory peak, 0B memory swap peak.

When I comment #huge_pages=on in configuration postgresql started without
problems.
Can someone tell me where is the problem?

Regards
Bogdan


Re: huge_pages=on cause could not map anonymous shared memory: Cannot allocate memory

2025-05-07 Thread Bogdan Siara
Hi Alicja,
Thanks for your advice, now postgresql works fine with 'huge_pages=on'.
Regards
Bogdan

śr., 7 maj 2025 o 14:17 Alicja Kucharczyk 
napisał(a):

> Hi Bogdan,
> The root cause here is that the number of huge pages you've configured
> (vm.nr_hugepages = 980) is not sufficient.
> Each huge page on your system is 2 MB in size, so 980 pages give you
> roughly 1.96 GB of memory (980 × 2 MB). However, PostgreSQL is clearly
> requesting about 2.2 GB of shared memory (specifically, 2204106752 bytes as
> shown in the error message you provided), which exceeds what's available
> through huge pages.
>
> That’s why PostgreSQL fails to start when huge_pages = on - it requires
> the entire shared memory segment to come from huge pages and refuses to
> fall back to regular ones.
>
> Earlier, you had the huge_pages setting commented out, which means
> PostgreSQL used the default value: huge_pages = try. In that mode, it first
> attempts to use huge pages, but if that fails (like in your case due to
> insufficient allocation), it falls back to standard memory pages — which is
> why the instance started without issues then.
>
> To fix the issue, you should increase vm.nr_hugepages to at least 1100 to
> fully cover the shared memory request (you can go a bit higher to be safe
> and then reduce it as described in the article I'm pasting the link to).
>
> Also, a side note: max_connections = 1000 is quite high for an instance
> with 8 GB of RAM and only 2 vCPUs. Even if huge pages are properly
> allocated, such a high number of connections can lead to performance
> issues. You might want to consider lowering it or using a connection pooler
> like PgBouncer.
>
> If you’d like to understand how huge pages work in PostgreSQL, including
> how to calculate memory needs and configure the OS properly, I wrote a
> detailed article some time ago (still valid). It’s in Polish, which I
> assume is fine for you:
> https://linuxpolska.com/pl/baza-wiedzy/blog/postgres-pamieci-ram-tipstricks/
>
> best regards,
> Alicja Kucharczyk
>
>>