WAL files keep piling up
Hi, I have a PostgreSQL (9.6.15) two node cluster setup with Patroni. The instances are configured to work in master <-> synchronous standby setup and both run in docker containers with pgdata volume mounted from host. When master is restarted the synchronous standby automatically takes the role of master and master starts operating as synchronous replica. Everything works fine but whenever I restart master instance it creates a new wal file in pg_xlog/ however old wal files are not cleaned up. They pile up and in my reproduction environment, where there are no operations on the database, they currently occupy: du -sh pg_xlog/ 17G pg_xlog/ and the number of the files is more or less: ls pg_xlog/ | grep -v history | wc -l 1024 I've searched through the mailing lists and articles like https://dataegret.com/2018/04/pg_wal-is-too-big-whats-going-on/ and the main problems mentioned in most of the places are: 1. failing archive command 2. abandoned (inactive) replication slot 3. issues with checkpoints 3. too big wal_keep_segments value However, none of those problems seem to apply to my deployment: 1. I have `archive_mode` set to `off` 1a. I even tried enabling it and setting the `archive_command` to '/bin/true' just to confirm a suggestion found in one of the post on the mailing list (that didn't improve anything) 2. the only replication slot in `pg_replication_slots` is the one related to the synchronous replica and it is active 3. I've enabled `log_checkpoint` but doesn't see any errors or warnings related to checkpoints triggered either automatically or manually via `CHECKPOINT;` 4. `wal_keep_segments` is set to 8 and `max_wal_size` is set to 1GB Is there anything that I should check that could shed some light on this? A few configuration options taken from `pg_settings`: archive_command = (disabled) archive_mode = off archive_timeout = 0 check_function_bodies = on checkpoint_completion_target = 0.5 checkpoint_flush_after = 32 checkpoint_timeout = 300 hot_standby = on hot_standby_feedback = off max_replication_slots = 10 max_wal_senders = 10 max_wal_size = 64 min_wal_size = 5 synchronous_commit = on synchronous_standby_names = patroni1 wal_block_size = 8192 wal_buffers = 512 wal_compression = off wal_keep_segments = 8 wal_level = replica wal_log_hints = on wal_receiver_status_interval = 10 wal_receiver_timeout = 6 wal_retrieve_retry_interval = 5000 wal_segment_size = 2048 wal_sender_timeout = 6 wal_sync_method = fdatasync wal_writer_delay = 200 wal_writer_flush_after = 128 Kind regards
Re: WAL files keep piling up
A stupid question. How many .ready files are there? Regards, Ninad Shah On Wed, 22 Dec 2021 at 21:04, Zbigniew Kostrzewa wrote: > Hi, > > I have a PostgreSQL (9.6.15) two node cluster setup with Patroni. The > instances are configured to work in master <-> synchronous standby setup > and both run in docker containers with pgdata volume mounted from host. > When master is restarted the synchronous standby automatically takes the > role of master and master starts operating as synchronous replica. > > Everything works fine but whenever I restart master instance it creates a > new wal file in pg_xlog/ however old wal files are not cleaned up. They > pile up and in my reproduction environment, where there are no operations > on the database, they currently occupy: > du -sh pg_xlog/ > 17G pg_xlog/ > > and the number of the files is more or less: > ls pg_xlog/ | grep -v history | wc -l > 1024 > > I've searched through the mailing lists and articles like > https://dataegret.com/2018/04/pg_wal-is-too-big-whats-going-on/ and the > main problems mentioned in most of the places are: > 1. failing archive command > 2. abandoned (inactive) replication slot > 3. issues with checkpoints > 3. too big wal_keep_segments value > > However, none of those problems seem to apply to my deployment: > 1. I have `archive_mode` set to `off` > 1a. I even tried enabling it and setting the `archive_command` to > '/bin/true' just to confirm a suggestion found in one of the post on the > mailing list (that didn't improve anything) > 2. the only replication slot in `pg_replication_slots` is the one related > to the synchronous replica and it is active > 3. I've enabled `log_checkpoint` but doesn't see any errors or warnings > related to checkpoints triggered either automatically or manually via > `CHECKPOINT;` > 4. `wal_keep_segments` is set to 8 and `max_wal_size` is set to 1GB > > Is there anything that I should check that could shed some light on this? > > A few configuration options taken from `pg_settings`: > archive_command = (disabled) > archive_mode = off > archive_timeout = 0 > check_function_bodies = on > checkpoint_completion_target = 0.5 > checkpoint_flush_after = 32 > checkpoint_timeout = 300 > hot_standby = on > hot_standby_feedback = off > max_replication_slots = 10 > max_wal_senders = 10 > max_wal_size = 64 > min_wal_size = 5 > synchronous_commit = on > synchronous_standby_names = patroni1 > wal_block_size = 8192 > wal_buffers = 512 > wal_compression = off > wal_keep_segments = 8 > wal_level = replica > wal_log_hints = on > wal_receiver_status_interval = 10 > wal_receiver_timeout = 6 > wal_retrieve_retry_interval = 5000 > wal_segment_size = 2048 > wal_sender_timeout = 6 > wal_sync_method = fdatasync > wal_writer_delay = 200 > wal_writer_flush_after = 128 > > Kind regards >
Re: WAL files keep piling up
Thanks for responding. On current master it looks like so: ls pg_xlog/archive_status/ | grep ready | wc -l 0 ls pg_xlog/archive_status/ | grep done | wc -l 501 Kind regards. śr., 22 gru 2021 o 19:01 Ninad Shah napisał(a): > A stupid question. How many .ready files are there? > > > Regards, > Ninad Shah > > On Wed, 22 Dec 2021 at 21:04, Zbigniew Kostrzewa > wrote: > >> Hi, >> >> I have a PostgreSQL (9.6.15) two node cluster setup with Patroni. The >> instances are configured to work in master <-> synchronous standby setup >> and both run in docker containers with pgdata volume mounted from host. >> When master is restarted the synchronous standby automatically takes the >> role of master and master starts operating as synchronous replica. >> >> Everything works fine but whenever I restart master instance it creates a >> new wal file in pg_xlog/ however old wal files are not cleaned up. They >> pile up and in my reproduction environment, where there are no operations >> on the database, they currently occupy: >> du -sh pg_xlog/ >> 17G pg_xlog/ >> >> and the number of the files is more or less: >> ls pg_xlog/ | grep -v history | wc -l >> 1024 >> >> I've searched through the mailing lists and articles like >> https://dataegret.com/2018/04/pg_wal-is-too-big-whats-going-on/ and the >> main problems mentioned in most of the places are: >> 1. failing archive command >> 2. abandoned (inactive) replication slot >> 3. issues with checkpoints >> 3. too big wal_keep_segments value >> >> However, none of those problems seem to apply to my deployment: >> 1. I have `archive_mode` set to `off` >> 1a. I even tried enabling it and setting the `archive_command` to >> '/bin/true' just to confirm a suggestion found in one of the post on the >> mailing list (that didn't improve anything) >> 2. the only replication slot in `pg_replication_slots` is the one related >> to the synchronous replica and it is active >> 3. I've enabled `log_checkpoint` but doesn't see any errors or warnings >> related to checkpoints triggered either automatically or manually via >> `CHECKPOINT;` >> 4. `wal_keep_segments` is set to 8 and `max_wal_size` is set to 1GB >> >> Is there anything that I should check that could shed some light on this? >> >> A few configuration options taken from `pg_settings`: >> archive_command = (disabled) >> archive_mode = off >> archive_timeout = 0 >> check_function_bodies = on >> checkpoint_completion_target = 0.5 >> checkpoint_flush_after = 32 >> checkpoint_timeout = 300 >> hot_standby = on >> hot_standby_feedback = off >> max_replication_slots = 10 >> max_wal_senders = 10 >> max_wal_size = 64 >> min_wal_size = 5 >> synchronous_commit = on >> synchronous_standby_names = patroni1 >> wal_block_size = 8192 >> wal_buffers = 512 >> wal_compression = off >> wal_keep_segments = 8 >> wal_level = replica >> wal_log_hints = on >> wal_receiver_status_interval = 10 >> wal_receiver_timeout = 6 >> wal_retrieve_retry_interval = 5000 >> wal_segment_size = 2048 >> wal_sender_timeout = 6 >> wal_sync_method = fdatasync >> wal_writer_delay = 200 >> wal_writer_flush_after = 128 >> >> Kind regards >> >
Re: WAL files keep piling up
Zbigniew Kostrzewa writes: > Thanks for responding. On current master it looks like so: > ls pg_xlog/archive_status/ | grep ready | wc -l > 0 > ls pg_xlog/archive_status/ | grep done | wc -l > 501 Hmm, if you've got archiving turned off, I wonder why you have any .done files at all. Perhaps they are leftover from a time when you did have archiving on, and for some reason they are confusing the non-archive-mode cleanup logic. Anyway, you could certainly manually remove the .done files and the corresponding WAL segment files, and then see what happens. BTW, I'm sure you realize that 9.6.15 is not exactly current. regards, tom lane
Re: WAL files keep piling up
I've just checked my second reproduction cluster (also Patroni but this time on K8s). It also has non-empty `archive_status/` directory: ls pg_xlog/ | grep -v history | wc -l 165 ls pg_xlog/archive_status/ | wc -l 81 ls pg_xlog/archive_status/ | grep done | wc -l 81 but on this cluster I did not enable `archive_mode` at any time: postgres=# select name,setting from pg_settings where name like 'archive_%'; name | setting -+ archive_command | (disabled) archive_mode| off archive_timeout | 0 Yes, I am aware 9.6 is pretty old, soon I will be replacing it with 11.x. Thanks. Kind regards. śr., 22 gru 2021 o 19:18 Tom Lane napisał(a): > Zbigniew Kostrzewa writes: > > Thanks for responding. On current master it looks like so: > > ls pg_xlog/archive_status/ | grep ready | wc -l > > 0 > > ls pg_xlog/archive_status/ | grep done | wc -l > > 501 > > Hmm, if you've got archiving turned off, I wonder why you have > any .done files at all. Perhaps they are leftover from a time > when you did have archiving on, and for some reason they are > confusing the non-archive-mode cleanup logic. > > Anyway, you could certainly manually remove the .done files and > the corresponding WAL segment files, and then see what happens. > > BTW, I'm sure you realize that 9.6.15 is not exactly current. > > regards, tom lane >
