WAL files keep piling up

2021-12-22 Thread Zbigniew Kostrzewa
Hi,

I have a PostgreSQL (9.6.15) two node cluster setup with Patroni. The
instances are configured to work in master <-> synchronous standby setup
and both run in docker containers with pgdata volume mounted from host.
When master is restarted the synchronous standby automatically takes the
role of master and master starts operating as synchronous replica.

Everything works fine but whenever I restart master instance it creates a
new wal file in pg_xlog/ however old wal files are not cleaned up. They
pile up and in my reproduction environment, where there are no operations
on the database, they currently occupy:
du -sh pg_xlog/
17G pg_xlog/

and the number of the files is more or less:
ls pg_xlog/ | grep -v history | wc -l
1024

I've searched through the mailing lists and articles like
https://dataegret.com/2018/04/pg_wal-is-too-big-whats-going-on/ and the
main problems mentioned in most of the places are:
1. failing archive command
2. abandoned (inactive) replication slot
3. issues with checkpoints
3. too big wal_keep_segments value

However, none of those problems seem to apply to my deployment:
1. I have `archive_mode` set to `off`
1a. I even tried enabling it and setting the `archive_command` to
'/bin/true' just to confirm a suggestion found in one of the post on the
mailing list (that didn't improve anything)
2. the only replication slot in `pg_replication_slots` is the one related
to the synchronous replica and it is active
3. I've enabled `log_checkpoint` but doesn't see any errors or warnings
related to checkpoints triggered either automatically or manually via
`CHECKPOINT;`
4. `wal_keep_segments` is set to 8 and `max_wal_size` is set to 1GB

Is there anything that I should check that could shed some light on this?

A few configuration options taken from `pg_settings`:
archive_command = (disabled)
archive_mode = off
archive_timeout = 0
check_function_bodies = on
checkpoint_completion_target = 0.5
checkpoint_flush_after = 32
checkpoint_timeout = 300
hot_standby = on
hot_standby_feedback = off
max_replication_slots = 10
max_wal_senders = 10
max_wal_size = 64
min_wal_size = 5
synchronous_commit = on
synchronous_standby_names = patroni1
wal_block_size = 8192
wal_buffers = 512
wal_compression = off
wal_keep_segments = 8
wal_level = replica
wal_log_hints = on
wal_receiver_status_interval = 10
wal_receiver_timeout = 6
wal_retrieve_retry_interval = 5000
wal_segment_size = 2048
wal_sender_timeout = 6
wal_sync_method = fdatasync
wal_writer_delay = 200
wal_writer_flush_after = 128

Kind regards


Re: WAL files keep piling up

2021-12-22 Thread Ninad Shah
A stupid question. How many .ready files are there?


Regards,
Ninad Shah

On Wed, 22 Dec 2021 at 21:04, Zbigniew Kostrzewa 
wrote:

> Hi,
>
> I have a PostgreSQL (9.6.15) two node cluster setup with Patroni. The
> instances are configured to work in master <-> synchronous standby setup
> and both run in docker containers with pgdata volume mounted from host.
> When master is restarted the synchronous standby automatically takes the
> role of master and master starts operating as synchronous replica.
>
> Everything works fine but whenever I restart master instance it creates a
> new wal file in pg_xlog/ however old wal files are not cleaned up. They
> pile up and in my reproduction environment, where there are no operations
> on the database, they currently occupy:
> du -sh pg_xlog/
> 17G pg_xlog/
>
> and the number of the files is more or less:
> ls pg_xlog/ | grep -v history | wc -l
> 1024
>
> I've searched through the mailing lists and articles like
> https://dataegret.com/2018/04/pg_wal-is-too-big-whats-going-on/ and the
> main problems mentioned in most of the places are:
> 1. failing archive command
> 2. abandoned (inactive) replication slot
> 3. issues with checkpoints
> 3. too big wal_keep_segments value
>
> However, none of those problems seem to apply to my deployment:
> 1. I have `archive_mode` set to `off`
> 1a. I even tried enabling it and setting the `archive_command` to
> '/bin/true' just to confirm a suggestion found in one of the post on the
> mailing list (that didn't improve anything)
> 2. the only replication slot in `pg_replication_slots` is the one related
> to the synchronous replica and it is active
> 3. I've enabled `log_checkpoint` but doesn't see any errors or warnings
> related to checkpoints triggered either automatically or manually via
> `CHECKPOINT;`
> 4. `wal_keep_segments` is set to 8 and `max_wal_size` is set to 1GB
>
> Is there anything that I should check that could shed some light on this?
>
> A few configuration options taken from `pg_settings`:
> archive_command = (disabled)
> archive_mode = off
> archive_timeout = 0
> check_function_bodies = on
> checkpoint_completion_target = 0.5
> checkpoint_flush_after = 32
> checkpoint_timeout = 300
> hot_standby = on
> hot_standby_feedback = off
> max_replication_slots = 10
> max_wal_senders = 10
> max_wal_size = 64
> min_wal_size = 5
> synchronous_commit = on
> synchronous_standby_names = patroni1
> wal_block_size = 8192
> wal_buffers = 512
> wal_compression = off
> wal_keep_segments = 8
> wal_level = replica
> wal_log_hints = on
> wal_receiver_status_interval = 10
> wal_receiver_timeout = 6
> wal_retrieve_retry_interval = 5000
> wal_segment_size = 2048
> wal_sender_timeout = 6
> wal_sync_method = fdatasync
> wal_writer_delay = 200
> wal_writer_flush_after = 128
>
> Kind regards
>


Re: WAL files keep piling up

2021-12-22 Thread Zbigniew Kostrzewa
Thanks for responding. On current master it looks like so:

ls pg_xlog/archive_status/ | grep ready | wc -l
0

ls pg_xlog/archive_status/ | grep done | wc -l
501

Kind regards.


śr., 22 gru 2021 o 19:01 Ninad Shah  napisał(a):

> A stupid question. How many .ready files are there?
>
>
> Regards,
> Ninad Shah
>
> On Wed, 22 Dec 2021 at 21:04, Zbigniew Kostrzewa 
> wrote:
>
>> Hi,
>>
>> I have a PostgreSQL (9.6.15) two node cluster setup with Patroni. The
>> instances are configured to work in master <-> synchronous standby setup
>> and both run in docker containers with pgdata volume mounted from host.
>> When master is restarted the synchronous standby automatically takes the
>> role of master and master starts operating as synchronous replica.
>>
>> Everything works fine but whenever I restart master instance it creates a
>> new wal file in pg_xlog/ however old wal files are not cleaned up. They
>> pile up and in my reproduction environment, where there are no operations
>> on the database, they currently occupy:
>> du -sh pg_xlog/
>> 17G pg_xlog/
>>
>> and the number of the files is more or less:
>> ls pg_xlog/ | grep -v history | wc -l
>> 1024
>>
>> I've searched through the mailing lists and articles like
>> https://dataegret.com/2018/04/pg_wal-is-too-big-whats-going-on/ and the
>> main problems mentioned in most of the places are:
>> 1. failing archive command
>> 2. abandoned (inactive) replication slot
>> 3. issues with checkpoints
>> 3. too big wal_keep_segments value
>>
>> However, none of those problems seem to apply to my deployment:
>> 1. I have `archive_mode` set to `off`
>> 1a. I even tried enabling it and setting the `archive_command` to
>> '/bin/true' just to confirm a suggestion found in one of the post on the
>> mailing list (that didn't improve anything)
>> 2. the only replication slot in `pg_replication_slots` is the one related
>> to the synchronous replica and it is active
>> 3. I've enabled `log_checkpoint` but doesn't see any errors or warnings
>> related to checkpoints triggered either automatically or manually via
>> `CHECKPOINT;`
>> 4. `wal_keep_segments` is set to 8 and `max_wal_size` is set to 1GB
>>
>> Is there anything that I should check that could shed some light on this?
>>
>> A few configuration options taken from `pg_settings`:
>> archive_command = (disabled)
>> archive_mode = off
>> archive_timeout = 0
>> check_function_bodies = on
>> checkpoint_completion_target = 0.5
>> checkpoint_flush_after = 32
>> checkpoint_timeout = 300
>> hot_standby = on
>> hot_standby_feedback = off
>> max_replication_slots = 10
>> max_wal_senders = 10
>> max_wal_size = 64
>> min_wal_size = 5
>> synchronous_commit = on
>> synchronous_standby_names = patroni1
>> wal_block_size = 8192
>> wal_buffers = 512
>> wal_compression = off
>> wal_keep_segments = 8
>> wal_level = replica
>> wal_log_hints = on
>> wal_receiver_status_interval = 10
>> wal_receiver_timeout = 6
>> wal_retrieve_retry_interval = 5000
>> wal_segment_size = 2048
>> wal_sender_timeout = 6
>> wal_sync_method = fdatasync
>> wal_writer_delay = 200
>> wal_writer_flush_after = 128
>>
>> Kind regards
>>
>


Re: WAL files keep piling up

2021-12-22 Thread Tom Lane
Zbigniew Kostrzewa  writes:
> Thanks for responding. On current master it looks like so:
> ls pg_xlog/archive_status/ | grep ready | wc -l
> 0
> ls pg_xlog/archive_status/ | grep done | wc -l
> 501

Hmm, if you've got archiving turned off, I wonder why you have
any .done files at all.  Perhaps they are leftover from a time
when you did have archiving on, and for some reason they are
confusing the non-archive-mode cleanup logic.

Anyway, you could certainly manually remove the .done files and
the corresponding WAL segment files, and then see what happens.

BTW, I'm sure you realize that 9.6.15 is not exactly current.

regards, tom lane




Re: WAL files keep piling up

2021-12-22 Thread Zbigniew Kostrzewa
I've just checked my second reproduction cluster (also Patroni but this
time on K8s). It also has non-empty `archive_status/` directory:

ls pg_xlog/ | grep -v history | wc -l
165

ls pg_xlog/archive_status/ | wc -l
81
ls pg_xlog/archive_status/ | grep done | wc -l
81

but on this cluster I did not enable `archive_mode` at any time:

postgres=# select name,setting from pg_settings where name like 'archive_%';
  name   |  setting
-+
 archive_command | (disabled)
 archive_mode| off
 archive_timeout | 0

Yes, I am aware 9.6 is pretty old, soon I will be replacing it with 11.x.
Thanks.

Kind regards.


śr., 22 gru 2021 o 19:18 Tom Lane  napisał(a):

> Zbigniew Kostrzewa  writes:
> > Thanks for responding. On current master it looks like so:
> > ls pg_xlog/archive_status/ | grep ready | wc -l
> > 0
> > ls pg_xlog/archive_status/ | grep done | wc -l
> > 501
>
> Hmm, if you've got archiving turned off, I wonder why you have
> any .done files at all.  Perhaps they are leftover from a time
> when you did have archiving on, and for some reason they are
> confusing the non-archive-mode cleanup logic.
>
> Anyway, you could certainly manually remove the .done files and
> the corresponding WAL segment files, and then see what happens.
>
> BTW, I'm sure you realize that 9.6.15 is not exactly current.
>
> regards, tom lane
>