implicit transaction changes trigger behaviour

2019-08-29 Thread Willy-Bas Loos
Hi,

I currently have a fairly complex use case to solve and one thing i tried
was a deferred constraint trigger. I'm not sure if this solution is the way
to go, but anyway: As i was testing my code, i noticed that the trigger
behaves differently depending on whether or not i explicitly use BEGIN and
COMMIT, even though there is only 1 query in the transaction.
I am wondering if this is a bug in postgresql?
I'm using postgresql 10.10 on Debian.

Here's an example that reproduces the behaviour:

/*
https://www.postgresql.org/docs/10/sql-createtrigger.html
Constraint triggers must be AFTER ROW triggers on plain tables (not foreign
tables). They can be fired either at the end of the statement causing the
triggering event, or at the end of the containing transaction; in the
latter case they are said to be deferred. A pending deferred-trigger firing
can also be forced to happen immediately by using SET CONSTRAINTS.
Constraint triggers are expected to raise an exception when the constraints
they implement are violated.
*/

create table a(a_id serial primary key);
create table b(b_id serial primary key, a_id integer not null, type integer
not null);
create or replace function has_1b_type1() returns trigger as $$
declare
n_b_type1 integer; --the number of records in table b with type 1 that
correspond to OLD.id
begin
select count(*) into n_b_type1
from b
join a on b.a_id = a.a_id
where b.type = 1;
if n_b_type1 != 1 then
raise exception 'Each record of a must have exactly 1 corresponding records
in b of type 1. But after this delete the a-record with id % would have %
b-records of type 1, so the operation has been cancelled.', OLD.a_id,
n_b_type1;
else
--The return value is ignored for row-level triggers fired after an
operation, and so they can return NULL.
return null;
end if;
end
$$ language plpgsql stable;
create constraint trigger tr_has_1b_type1_del
after delete on b
deferrable initially deferred for each row
execute procedure has_1b_type1();

begin;
insert into a (a_id)
values(nextval('a_a_id_seq'));
insert into b(a_id, type)
values(currval('a_a_id_seq'), 1);
--also some other data, just to illustrate
insert into b(a_id, type)
values(currval('a_a_id_seq'), 2);
insert into b(a_id, type)
values(nextval('a_a_id_seq'), 3);
commit;

begin;
delete from b;
commit;
--ERROR:  Each record of a must have exactly 1 corresponding records in b
of type 1. But after this delete the a-record with id 1 would have 0
b-records of type 1, so the operation has been cancelled.

delete from b;
--DELETE 3
--Query returned successfully in 91 msec.


-- 
Willy-Bas Loos


Re: implicit transaction changes trigger behaviour

2019-09-03 Thread Willy-Bas Loos
Thank you so much, the "stable" thing was it.
I'm not sure if it is underdocumented, i clearly didn't adhere to the rule
that a stable function " is guaranteed to return the same results given the
same arguments for all rows within a single statement".
BTW in my example i made a mistake too, but that was beside the point
really.

Cheers,
Willy-Bas

On Thu, Aug 29, 2019 at 3:35 PM Tom Lane  wrote:

> Willy-Bas Loos  writes:
> > I currently have a fairly complex use case to solve and one thing i tried
> > was a deferred constraint trigger. I'm not sure if this solution is the
> way
> > to go, but anyway: As i was testing my code, i noticed that the trigger
> > behaves differently depending on whether or not i explicitly use BEGIN
> and
> > COMMIT, even though there is only 1 query in the transaction.
> > I am wondering if this is a bug in postgresql?
>
> I think the issue is that you marked the trigger as STABLE.  That causes
> it to use the calling query's snapshot so it doesn't see the updates,
> if it's fired during the delete query and not during the subsequent
> COMMIT.  If I remove the STABLE label then it works as you expect.
>
> This is probably under-documented but I'm not sure that it should be
> considered a bug.
>
> The trigger seems a bit broken besides that, in that the comments claim it
> has something to do with the OLD row's id field(s) but the query is not in
> fact taking that into account.
>
> regards, tom lane
>


-- 
Willy-Bas Loos


temporary data after diskspace error

2020-01-27 Thread Willy-Bas Loos
Hi,

We have a server with postgresql 9.4.12 on ubuntu.
There has been a sudden rise in the amount of disk space used by
postgresql, causing a diskspace error:

2020-01-22 17:24:37 CET db: ip: us: PANIC:  could not write to file
"pg_xlog/xlogtemp.23346": No space left on device
2020-01-22 17:24:37 CET db: ip: us: LOG:  WAL writer process (PID 23346)
was terminated by signal 6: Aborted
2020-01-22 17:24:37 CET db: ip: us: LOG:  terminating any other active
server processes

The disk was at roughly 75% before and something or someone added  150 GB
to the database, bringing the disk space usage to 100%.
The query that got the initial error was creating a rather large table, but
it is not confirmed that this is the only source of the large-ish data
amount. But it is possible.

Now i can see in pg_stat_database and postgresql/9.4/main/base/pgsql_tmp
that there is 90GB of temporary files in the database.

Could the amount of temp files be caused by the unfinished query? I'm not
sure how strong Signal 6 is exactly.

And also: How can i make postgres clean up the files?
Can it be done without restarting the cluster?
Will restarting it help?


-- 
Willy-Bas Loos


Re: temporary data after diskspace error

2020-01-27 Thread Willy-Bas Loos
Ok, thanks everyone!
Will there be a lot of downtime to delete those 90GB of temp files?

Will postgres just delete those files without processing them or should I
brace for some downtime?


Op ma 27 jan. 2020 17:15 schreef Tom Lane :

> Willy-Bas Loos  writes:
> > And also: How can i make postgres clean up the files?
> > Can it be done without restarting the cluster?
> > Will restarting it help?
>
> A restart will clean out temp files.  I don't think there's any
> terribly safe way to do it without that.  You could manually
> remove such files that haven't been accessed recently, but the
> risk of human error is high.
>
> regards, tom lane
>


Re: temporary data after diskspace error

2020-01-28 Thread Willy-Bas Loos
I did the restart and it took seconds.
This was on a SSD.
BTW on ubuntu and debian i never use pg_ctl directly, postgresql-common has
a very nice CLI for this.
For the restart i used: sudo pg_ctlcluster --force 9.4 main restart

Thanks for all the good advice!

On Mon, Jan 27, 2020 at 10:15 PM Tom Lane  wrote:

> Willy-Bas Loos  writes:
> > Will there be a lot of downtime to delete those 90GB of temp files?
> > Will postgres just delete those files without processing them or should I
> > brace for some downtime?
>
> It's just a directory scan and an unlink() for each file that has
> the right filename pattern to be a temp file.  If you've got a
> reasonable filesystem I wouldn't expect it to take long.
> Maybe a minute? (but don't quote me on that)
>
>     regards, tom lane
>


-- 
Willy-Bas Loos


logical replication worker can't find postgis function

2022-04-22 Thread Willy-Bas Loos
Hi!

I'm using logical replication on postgresql 13. On the subscriber, there's
a trigger on a table that calculates the area of the geometry that's in
another column.
I enabled the trigger with
ALTER TABLE atable ENABLE ALWAYS TRIGGER atrigger;

But the logical replication worker can't find st_area:
2022-04-22 13:14:11.244 CEST [1932237] LOG:  logical replication apply
worker for subscription "ba_acc1" has started
2022-04-22 13:14:11.282 CEST [1932237] ERROR:  function
st_area(public.geometry) does not exist at character 14
2022-04-22 13:14:11.282 CEST [1932237] HINT:  No function matches the given
name and argument types. You might need to add explicit type casts.
2022-04-22 13:14:11.282 CEST [1932237] QUERY:  SELECT
round(st_area(NEW.epsg28992_geom))
2022-04-22 13:14:11.282 CEST [1932237] CONTEXT:  PL/pgSQL function
util.location_extras() line 3 at assignment
2022-04-22 13:14:11.285 CEST [1562110] LOG:  background worker "logical
replication worker" (PID 1932237) exited with exit code 1

The trigger works well when I fire it in a normal update query.
How can this happen and how can I resolve this?

-- 
Willy-Bas Loos


Re: logical replication worker can't find postgis function

2022-04-22 Thread Willy-Bas Loos
On Fri, Apr 22, 2022 at 3:20 PM Laurenz Albe 
wrote:

>
> The trigger function is bad and dangerous, because it relies on the
> current setting of "search_path".
>
> You notice that with logical replication, because "search_path" is empty
> to avoid security problems.
>

Thanks a lot!
Do you mean that all trigger functions are bad and dangerous, or just mine?
Do you have any suggestions for an alternative?

Cheers,
-- 
Willy-Bas Loos


Re: logical replication worker can't find postgis function

2022-04-22 Thread Willy-Bas Loos
OK thanks for the help, have a nice weekend!

On Fri, Apr 22, 2022 at 3:39 PM Laurenz Albe 
wrote:

> On Fri, 2022-04-22 at 15:26 +0200, Willy-Bas Loos wrote:
> > On Fri, Apr 22, 2022 at 3:20 PM Laurenz Albe 
> wrote:
> > >
> > > The trigger function is bad and dangerous, because it relies on the
> current setting of "search_path".
> > >
> > > You notice that with logical replication, because "search_path" is
> empty to avoid security problems.
> >
> > Thanks a lot!
> > Do you mean that all trigger functions are bad and dangerous, or just
> mine?
> > Do you have any suggestions for an alternative?
>
> There is nothing wrong per se with using trigger functions.
>
> But, to attempt a generic statement, any function that fails if you change
> "search_path"
> is a potential problem.
>
> If your application makes sure that "search_path" is always set correctly,
> the problem
> is smaller.
>
> If highly privileged processes call the function, the problem becomes
> worse, because the
> potential damage is bigger.
>
> The best way to make sure nothing can happen is to create all functions
> with a
> hard-wired "search_path".  Then nothing can go wrong.
>
> Yours,
> Laurenz Albe
> --
> Cybertec | https://www.cybertec-postgresql.com
>
>

-- 
Willy-Bas Loos


tcp keepalives not sent during long query

2022-12-13 Thread Willy-Bas Loos
Hi!

Some users of our database have a NAT firewall and keep a postgres client
(e.g. pgAdmin ) open for hours. To prevent the connection from being killed
by the firewall due to inactivity, we configured tcp_keepalives_idle = 120
so that the server sends keepalives and keeps the connection active. (this
is on debian)

We've recently upgraded from postgres 9.3 to 13 and from debian 6 to 11.
I'm getting the complaint that since the upgrade, the connection breaks.
But only when they run a long query.

Has anything changed in postgres that might cause this? e.g. that
keepalives are only sent when the session is idle?

Thanks
-- 
Willy-Bas Loos


Re: tcp keepalives not sent during long query

2022-12-14 Thread Willy-Bas Loos
Thanks for your answer. I was afraid someone would say that...
I was hoping that the keepalives would be more of a matter of cooperation
between postgres and the OS.


On Wed, Dec 14, 2022 at 10:52 AM Laurenz Albe 
wrote:

> On Wed, 2022-12-14 at 08:55 +0100, Willy-Bas Loos wrote:
> > Some users of our database have a NAT firewall and keep a postgres
> client (e.g. pgAdmin )
> > open for hours. To prevent the connection from being killed by the
> firewall due to inactivity,
> > we configured tcp_keepalives_idle = 120 so that the server sends
> keepalives and keeps the
> > connection active. (this is on debian)
> >
> > We've recently upgraded from postgres 9.3 to 13 and from debian 6 to 11.
> > I'm getting the complaint that since the upgrade, the connection breaks.
> But only when they run a long query.
> >
> > Has anything changed in postgres that might cause this? e.g. that
> keepalives are only sent when the session is idle?
>
> It is the operating system kernel that sends keepalives, so that should be
> independent of
> what the PostgreSQL backend is doing.
>
> Yours,
> Laurenz Albe
>


-- 
Willy-Bas Loos


Re: tcp keepalives not sent during long query

2022-12-14 Thread Willy-Bas Loos
On Wed, Dec 14, 2022 at 6:38 PM Tom Lane  wrote:

> It'd be worth doing
>
> show tcp_keepalives_idle;
>
> Wow, you're right! It's in the postgresql.conf but it isn't set when I
reload the server
A restart also doesn't do it and even doing SET tcp_keepalives_idle=120;
doesn't work.
It gives me a confirmation, but then when I SHOW the value, it gives me 0.

wbloos=# set tcp_keepalives_idle=120;
SET
wbloos=# show tcp_keepalives_idle;
 tcp_keepalives_idle
-----
 0
(1 row)

-- 
Willy-Bas Loos


Re: tcp keepalives not sent during long query

2022-12-14 Thread Willy-Bas Loos
The version is PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on
x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110,
64-bit


Re: tcp keepalives not sent during long query

2022-12-15 Thread Willy-Bas Loos
Nice query, i keep learning new stuff here.
Anyway, that shows the correct line (80) in the config file, but the wrong
value.
Namely 0, where the config file has 120

On Thu, Dec 15, 2022 at 12:37 PM Laurenz Albe 
wrote:

> On Thu, 2022-12-15 at 08:31 +0100, Willy-Bas Loos wrote:
> > On Wed, Dec 14, 2022 at 6:38 PM Tom Lane  wrote:
> > > It'd be worth doing
> > >
> > > show tcp_keepalives_idle;
> >
> > Wow, you're right! It's in the postgresql.conf but it isn't set when I
> reload the server
> > A restart also doesn't do it and even doing SET tcp_keepalives_idle=120;
> doesn't work.
> > It gives me a confirmation, but then when I SHOW the value, it gives me
> 0.
> >
> > wbloos=# set tcp_keepalives_idle=120;
> > SET
> > wbloos=# show tcp_keepalives_idle;
> >  tcp_keepalives_idle
> > -
> >  0
> > (1 row)
>
> One good way to debug this is
>
>   SELECT setting, source, sourcefile, sourceline
>   FROM pg_settings
>   WHERE name = 'tcp_keepalives_idle';
>
> That will tell you from where you get the parameter value.
>
> Yours,
> Laurenz Albe
> --
> Cybertec | https://www.cybertec-postgresql.com
>


-- 
Willy-Bas Loos


Re: tcp keepalives not sent during long query

2022-12-15 Thread Willy-Bas Loos
On Thu, Dec 15, 2022 at 2:04 PM Geoff Winkless  wrote:

>
> Are you connected in this psql session via tcp or unix domain socket?
>
> Right, got me again. That was a Unix-domain socket.
When I do SHOW tcp_keepalives_idle; from pgAdmin it shows me 120, which is
correct.
Thanks for clarifying that.

So that means I still don't know why the connections are breaking.
I know that this could be anything, in any case not due to the postgres
server.

Our ISP has inspected the network traffic and indeed found empty TCP ACK
packages being sent back and forth to/from the user's IP, supposedly
keepalives.
I contacted the user and doublechecked their statement that they only have
the issue when running long queries. Turns out that this is not the case.
The connection also breaks on idle query windows only then they just
reconnect so it's not a problem.

The user now indicated that they can work around the issue by creating a
table as a result, instead of simply selecting the data to be displayed in
the client.
So we decided to cease our efforts to fix the issue.
Thanks a lot for your help!

-- 
Willy-Bas Loos


Re: tcp keepalives not sent during long query

2022-12-15 Thread Willy-Bas Loos
Yes exactly, Geoff Winkless pointed that out too.
I thought I'd found a cause for the breaking connections, but I hadn't.
Thanks a lot for your help!


On Thu, Dec 15, 2022 at 3:48 PM Tom Lane  wrote:

> Willy-Bas Loos  writes:
> > It gives me a confirmation, but then when I SHOW the value, it gives me
> 0.
>
> > wbloos=# set tcp_keepalives_idle=120;
> > SET
> > wbloos=# show tcp_keepalives_idle;
> >  tcp_keepalives_idle
> > -
> >  0
> > (1 row)
>
> That's the behavior I'd expect on a local (Unix-socket) connection
> ... you sure you're doing this from one of the problematic clients?
>
> regards, tom lane
>


-- 
Willy-Bas Loos


log level of "drop cascade" lists

2019-01-10 Thread Willy-Bas Loos
Hi,

(this is in version 9.4)
The SQL command DROP schema myschema CASCADE tells me that the full list of
items that the drop cascades to is in the log, but it isn't.

messages on stdout:
...
drop cascades to table myschema.mytable
and 143 other objects (see server log for list)
DROP SCHEMA

The log doesn't mention this at all, except
2019-01-10 12:10:45 CET ERROR:  canceling autovacuum task
2019-01-10 12:10:45 CET CONTEXT:  automatic analyze of table
"myschema.mytable"

log_min_messages is on the default value, which is warning.
#log_min_messages = warning

At first glance, it seems logical that the list of dropped items is a
"notice". But now that it seems that the cascade went further than i
anticipated, it is of a greater significance to me than that.
Also, truncating the list in the message and referring to the log is not
desirable IMHO if the default  setting is to not log the list.

So long story short: i think it would be wise to set the log level of "drop
cascade" lists to "warning".

Cheers,
-- 
Willy-Bas Loos


Re: log level of "drop cascade" lists

2019-01-11 Thread Willy-Bas Loos
On Thu, Jan 10, 2019 at 4:44 PM Adrian Klaver 
wrote:

>
> 1) BEGIN;
> DROP schema myschema CASCADE;
> ROLLBACK/COMMIT;
>
> 2) \d myschema.*


On Thu, Jan 10, 2019 at 5:04 PM Tom Lane  wrote:

> I think that would be met with more complaints than kudos.
> "WARNING" is supposed to mean "there's probably something wrong here",
> and a report of a cascaded drop is not that.
>

OK, both are good points.
Since the list is truncated and possibly affects objects in other schemas,
I would recommend setting
SET log_min_messages = notice;
for that session (for anyone else reading this, no need to set it in the
settings file, the above is an sql command).
And then it is possible to view the full list in the log (e.g. after
rolling back the transaction with the drop query).

Cheers,
-- 
Willy-Bas Loos


WAL accumulating, Logical Replication pg 13

2021-05-28 Thread Willy-Bas Loos
Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on
Debian 10. This is now in an Acceptance stage (DTAP). I have encountered a
problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal
and my disks are getting full. The oldest WAL file is 18 days old.
I use Logical Replication from the new cluster to another new cluster with
1 subscriber and 1 subscription.

pg_stat_subscription tells me all recent timestamps.
and this:
db=# select * from pg_replication_slots;
-[ RECORD 1 ]---+-
slot_name   | my_pub1
plugin  | pgoutput
slot_type   | logical
datoid  | 16401
database| db
temporary   | f
active  | t
active_pid  | 9480
xmin|
catalog_xmin| 269168
restart_lsn | D4/908BC268
confirmed_flush_lsn | E1/25BF5710
wal_status  | extended
safe_wal_size   |



I've had problems with diskspace on this server, with postgres crashing
because of it, then added more diskspace and postgres recovered. This
doesn't seem to be a problem now.

The *publication* has the options publish = 'insert, update, delete,
truncate', publish_via_partition_root = false
The *subscription* has the options connect = true, enabled = true,
create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'

The log on the publisher says:
2021-05-25 21:25:18.973 CEST [4584] user@db LOG:  starting logical decoding
for slot "my_pub1"
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL:  Streaming transactions
committing after D6/A82B5FE0, reading WAL from D4/908BC268.
2021-05-25 21:25:18.973 CEST [4584] user@db LOG:  logical decoding found
consistent point at D4/908BC268
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL:  There are no running
transactions.
2021-05-25 21:29:49.456 CEST [4614] user@db ERROR:  replication slot
"my_pub1" is active for PID 4584
2021-05-25 21:29:54.474 CEST [4615] user@db ERROR:  replication slot
"my_pub1" is active for PID 4584

And on the subscriber:
2021-05-28 21:23:46.702 CEST [40039] LOG:  logical replication apply worker
for subscription "my_pub1" has started
2021-05-28 21:23:46.712 CEST [40039] ERROR:  could not start WAL streaming:
ERROR:  replication slot "my_pub1" is active for PID 730
2021-05-28 21:23:46.714 CEST [19794] LOG:  background worker "logical
replication worker" (PID 40039) exited with exit code 1

The postgres settings on the *publisher* are:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 50GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 1GB
min_wal_size = 80MB
#archive_mode = off
max_wal_senders = 10 # max number of walsender processes
wal_sender_timeout = 1min # in milliseconds; 0 disables
max_replication_slots = 7 # max number of replication slots

On postgres settings on the *subscriber*:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 25GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 3GB
min_wal_size = 80MB
#archive_mode = off
wal_receiver_timeout = 1min # time that receiver waits for
max_logical_replication_workers = 10 # taken from max_worker_processes
max_sync_workers_per_subscription = 5 # taken from
max_logical_replication_workers

I've tried increasing wal_sender_timeout and wal_receiver_timeout to 10
minutes each, but this had no positive effect.

Some advice would be helpful
-- 
Willy-Bas Loos


Re: WAL accumulating, Logical Replication pg 13

2021-05-29 Thread Willy-Bas Loos
Yeah, indexes could slow things down, thanks. Btw I'm not using logical
replication for the upgrade, that's not supported for 9.3.
It was more complicated but that's beside the point.

I could just delete the publication and all that belongs to it and start
over. But since I'm trying out logical replication, I would like to be more
in control than that. It's there anything that I can dig into to find out
why the WAL is accumulating?

Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <
vijaykumarjain.git...@gmail.com>:

> I am not too sure with 9.3
> i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
> extension)
>
> one thing to note.
> logical replication initiates a copy from a snapshot, then changes from
> then on.
>
> I had a very high insert rate on my source tables (v9.6) and the
> destination (v11) could not keep up (it had tons of indexes when I copied
> the schema) and it took around a day as the table had around 12 indexes.
>
> So at the destination(v11), I dropped all but the primary index for each
> table, started subscription and when it was almost caught up, rebuilt the
> index on the destination concurrently.
> it completed in 4-5 hours without stopping the source.
> migration completed in a few mins :)
>
> not sure if this would help, but just FYI.
>
>
> On Sat, 29 May 2021 at 01:36, Willy-Bas Loos  wrote:
>
>> Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on
>> Debian 10. This is now in an Acceptance stage (DTAP). I have encountered a
>> problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal
>> and my disks are getting full. The oldest WAL file is 18 days old.
>> I use Logical Replication from the new cluster to another new cluster
>> with 1 subscriber and 1 subscription.
>>
>> pg_stat_subscription tells me all recent timestamps.
>> and this:
>> db=# select * from pg_replication_slots;
>> -[ RECORD 1 ]---+-
>> slot_name   | my_pub1
>> plugin  | pgoutput
>> slot_type   | logical
>> datoid  | 16401
>> database| db
>> temporary   | f
>> active  | t
>> active_pid  | 9480
>> xmin|
>> catalog_xmin| 269168
>> restart_lsn | D4/908BC268
>> confirmed_flush_lsn | E1/25BF5710
>> wal_status  | extended
>> safe_wal_size   |
>>
>>
>>
>> I've had problems with diskspace on this server, with postgres crashing
>> because of it, then added more diskspace and postgres recovered. This
>> doesn't seem to be a problem now.
>>
>> The *publication* has the options publish = 'insert, update, delete,
>> truncate', publish_via_partition_root = false
>> The *subscription* has the options connect = true, enabled = true,
>> create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'
>>
>> The log on the publisher says:
>> 2021-05-25 21:25:18.973 CEST [4584] user@db LOG:  starting logical
>> decoding for slot "my_pub1"
>> 2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL:  Streaming
>> transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
>> 2021-05-25 21:25:18.973 CEST [4584] user@db LOG:  logical decoding found
>> consistent point at D4/908BC268
>> 2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL:  There are no
>> running transactions.
>> 2021-05-25 21:29:49.456 CEST [4614] user@db ERROR:  replication slot
>> "my_pub1" is active for PID 4584
>> 2021-05-25 21:29:54.474 CEST [4615] user@db ERROR:  replication slot
>> "my_pub1" is active for PID 4584
>>
>> And on the subscriber:
>> 2021-05-28 21:23:46.702 CEST [40039] LOG:  logical replication apply
>> worker for subscription "my_pub1" has started
>> 2021-05-28 21:23:46.712 CEST [40039] ERROR:  could not start WAL
>> streaming: ERROR:  replication slot "my_pub1" is active for PID 730
>> 2021-05-28 21:23:46.714 CEST [19794] LOG:  background worker "logical
>> replication worker" (PID 40039) exited with exit code 1
>>
>> The postgres settings on the *publisher* are:
>> max_connections = 100 # (change requires restart)
>> tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
>> shared_buffers = 50GB # min 128kB
>> work_mem = 1GB # min 64kB
>> maintenance_work_mem = 10GB # min 1MB
>> logical_decoding_work_mem = 5GB # min 64kB
>> dynamic_shared_memory_type = posix # the default is the first option
>> max_worker_processes = 20 # (change requires resta

Re: WAL accumulating, Logical Replication pg 13

2021-05-31 Thread Willy-Bas Loos
Thank you for elaborating those possible causes and for the suggestions you
made.
1) if you have an inactive replication slot.
There is only 1 replication slot and it is active. So that is not the issue.

2) Do you have archiving enabled?
No, i never turned it on and so this is in the settings of  both publisher
and subscriber: #archive_mode = off (and show archive_mode; tells me the
same)

3) logical replication can be broken for multiple reasons, like conflicts
where the subscriber already has the data which primary wants to push. it
will not proceed until the conflicts are resolved.
That would have been in the log, but there isn't any conflict in the log.
Only the messages that i posted with my first message.

4) poor connectivity or the computer/network resources not able to keep up
with the load, can result in WAL pile up.
This would be strange since there is a 10Gbps connection within the same
rack. But it could theoretically be malfunctioning or the performance on
the subscriber could be too low.
If any of this is the case, shouldn't that be visible in
pg_stat_subscription ?

Thanks for the article, it's interesting to see how they transitioned from
londiste, even if the article is about pglogical, not logical replication
in the postgres core.
I was using Londiste to transfer the data to the new server and minimize
downtime, so the article might come in handy.
I prepared by reading the documentation, which is very straightforward.
>btw,
>how are you doing logical replication with 9.3 ? using a pglogical
extension ?
No, I'm not using logical replication in postgres 9.3 . Only on postgres 13.
About the link to the bug reports: Thanks for the suggestion. But first I'd
like to get some better grip on what is going on before searching for bugs.

Still, any help will be much appreciated

On Sat, May 29, 2021 at 5:16 PM Vijaykumar Jain <
vijaykumarjain.git...@gmail.com> wrote:

> WAL can be built up for reasons like
> 1) if you have an inactive replication slot. I mean you had a streaming
> replica which was turned off, but you did not remote the slot from primary.
> 2) Do you have archiving enabled? Are the archiving commands running fine
> ? if just the archiving is broken, then you can manually run
> archive cleanup provided, replication is all caught up fine.
>
> 3) logical replication can be broken for multiple reasons, like conflicts
> where the subscriber already has the data which primary wants to push. it
> will not proceed until the conflicts are resolved.
> 4) poor connectivity or the computer/network resources not able to keep up
> with the load, can result in WAL pile up.
>
> there are many blogs around logical replication issues, but when it was
> new in pg10, I read this.
> Recovery use cases for Logical Replication in PostgreSQL 10 | by
> Konstantin Evteev | AvitoTech | Medium
> <https://medium.com/avitotech/recovery-use-cases-for-logical-replication-in-postgresql-10-a1e6bab03072>
>
> btw,
> how are you doing logical replication with 9.3 ? using a pglogical
> extension ?
> I can try many things, but it would be wrong to make assumptions since i
> did not work with 9.3
> for ex.
> Bug fix: Using ExecCopySlot during multi insert by bdrouvotAWS · Pull
> Request #295 · 2ndQuadrant/pglogical (github.com)
> <https://github.com/2ndQuadrant/pglogical/pull/295>
> there are many issues posted here that may be relevant to your setup.
>
>
>
>
>
> On Sat, 29 May 2021 at 19:22, Willy-Bas Loos  wrote:
>
>> Yeah, indexes could slow things down, thanks. Btw I'm not using logical
>> replication for the upgrade, that's not supported for 9.3.
>> It was more complicated but that's beside the point.
>>
>> I could just delete the publication and all that belongs to it and start
>> over. But since I'm trying out logical replication, I would like to be more
>> in control than that. It's there anything that I can dig into to find out
>> why the WAL is accumulating?
>>
>> Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <
>> vijaykumarjain.git...@gmail.com>:
>>
>>> I am not too sure with 9.3
>>> i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
>>> extension)
>>>
>>> one thing to note.
>>> logical replication initiates a copy from a snapshot, then changes from
>>> then on.
>>>
>>> I had a very high insert rate on my source tables (v9.6) and the
>>> destination (v11) could not keep up (it had tons of indexes when I copied
>>> the schema) and it took around a day as the table had around 12 indexes.
>>>
>>> So at the destination(v11), I dropped all but the primary index for each
>>> table, started subscription

Re: WAL accumulating, Logical Replication pg 13

2021-05-31 Thread Willy-Bas Loos
On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain <
vijaykumarjain.git...@gmail.com> wrote:

> So I got it all wrong it seems :)
>
Thank you for taking the time to help me!

You upgraded to pg13 fine? , but while on pg13 you have issues with logical
> replication ?
>
Yes, the upgrade went fine. So here are some details:
I already had londiste running on postgres 9.3, but londiste wouldn't run
on Debian 10
So i first made the new server Debian 9 with postgres 9.6 and i started
replicating with londiste from 9.3 to 9.6
When all was ready, i stopped the replication to the 9.6 server and deleted
all londiste & pgq content with drop schema cascade.
Then I upgraded the server to Debian  10. Then i user pg_upgrade to upgrade
from postgres 9.6 to 13. (PostGIS versions were kept compatible).
Then I added logical replication and a third server as a subscriber.

I was going to write that replication is working fine (since the table
contains a lot of data and there are no conflicts in the log), but it turns
out that it isn't.
The subscriber is behind and It looks like there hasn't been any incoming
data after the initial data synchronization.
So at least now i know that the WAL is being retained with a reason. The
connection is working properly (via psql anyway)

I will also look into how to diagnose this from the system tables, e.g.
substracting LSN's to get some quantitative measure  for the lag.



> There is a path in the postgresql source user subscription folder iirc
> which covers various logical replication scenarios.
> That may help you just in case.
>
OK, so comments in the source code you mean?


Re: WAL accumulating, Logical Replication pg 13

2021-06-11 Thread Willy-Bas Loos
Hi, I was going to follow up on this one, sorry for the long silence.
The replication is working fine now, and I have no idea what the problem
was. Not cool.
If I find out, I will let you know.

On Mon, May 31, 2021 at 6:06 PM Tomas Pospisek  wrote:

> Hi Willy-Bas Loos,
>
> On 31.05.21 17:32, Willy-Bas Loos wrote:
> >
> >
> > On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain
> >  > <mailto:vijaykumarjain.git...@gmail.com>> wrote:
> >
> > So I got it all wrong it seems :)
> >
> > Thank you for taking the time to help me!
> >
> > You upgraded to pg13 fine? , but while on pg13 you have issues with
> > logical replication ?
> >
> > Yes, the upgrade went fine. So here are some details:
> > I already had londiste running on postgres 9.3, but londiste wouldn't
> > run on Debian 10
> > So i first made the new server Debian 9 with postgres 9.6 and i started
> > replicating with londiste from 9.3 to 9.6
> > When all was ready, i stopped the replication to the 9.6 server and
> > deleted all londiste & pgq content with drop schema cascade.
> > Then I upgraded the server to Debian  10. Then i user pg_upgrade to
> > upgrade from postgres 9.6 to 13. (PostGIS versions were kept compatible).
> > Then I added logical replication and a third server as a subscriber.
> >
> > I was going to write that replication is working fine (since the table
> > contains a lot of data and there are no conflicts in the log), but it
> > turns out that it isn't.
> > The subscriber is behind and It looks like there hasn't been any
> > incoming data after the initial data synchronization.
> > So at least now i know that the WAL is being retained with a reason. The
> > connection is working properly (via psql anyway)
>
> I once maybe had a similar problem due to some ports that were needed
> for replication being firewalled off or respectively the master having
> the wrong IP address of the old master (now standby server) or such.
>
> There was absolutely no word anywhere in any log about the problem I was
> just seeing the new postgres master not starting up after hours and
> hours of waiting after a failover. I somehow found out about the
> required port being blocked (I don't remember - maybe seing the
> unanswered SYNs in tcpdump? Or via ufw log entries?).
>
> > I will also look into how to diagnose this from the system tables, e.g.
> > substracting LSN's to get some quantitative measure  for the lag.
> >
> >
> >
> > There is a path in the postgresql source user subscription folder
> > iirc which covers various logical replication scenarios.
> > That may help you just in case.
> >
> > OK, so comments in the source code you mean?
> >
>
>

-- 
Willy-Bas Loos


Re: WAL accumulating, Logical Replication pg 13

2021-07-15 Thread Willy-Bas Loos
Hi, here's an update on this old thread.
I've found the problem, it was because of long transactions.
I reproduced the problem when I added a second publication and another
postgres cluster with a subscription to it (on acceptance still).

This query came in handy to show me what was happening:

select  r.pid, a.state as query_state, a.wait_event, r.application_name,
r.state as replication_state,
-- now()-a.query_start as query_time,
r.write_lag,
pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.write_lsn)) as write_lag2,
r.flush_lag,
pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.flush_lsn)) as flush_lag2,
r.replay_lag,
pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn,r.replay_lsn)) as replay_lag2,
r.client_addr, a.backend_start, a.state_change, a.query_start,
now()
from pg_stat_replication r
left join pg_stat_activity a on a.pid = r.pid
order by r.application_name;

What i saw was that replication wold "start", then enter a "catchup" state
and detect about 5GB of data, wich it would then process, so i would see
the "lag_..." numbers decline as i repeated that query.
When the number hit 0 or a slightly negative number, it would take a minute
(literally) and then the process was repeated. BTW I was seeing negative
numbers at some point so i swapped the values (e.g. r.write_lsn,
r.sent_lsn) and changed them back later. So I think that the 5GB was
actually -5GB, strangely. During the minute wait at a near-zero lag, the
query state would be "ReorderBufferWrite".

In the log there was an important extra line that I hadn't noticed before,
because it doesn't occur as often as all the lines about the workers
exiting in an error.
2021-07-14 14:04:58.110 CEST [22782] ERROR: terminating logical replication
worker due to timeout 2021-07-14 14:04:58.112 CEST [3720] LOG: background
worker "logical replication worker" (PID 22782) exited with exit code 1

I was already suspecting long transactions to be the problem, so I thought
that the timeout might be the problem.
At that point I changed the wal_receiver_timeout option to 0 in the
postgresql.conf file and the above behaviour ceased.
Instead I saw the lag values increase rapidly, with about 100GB per hour
(not a high traffic database).
There were long transactions on the subscriber database, but merely killing
those precesses wasn't enough because they would soon be started again.
I had to disable the job that started them for about 15 minutes, that did
the job. After that, those long transactions were not a problem anymore.

A guess at what is causing this:
The long transactions on the subscriber node are only a problem during the
initial snapshot phase. It uses a transaction and needs heavy locks on the
tables that are to be replicated.
The initial snapshot was cancelled because it didn't end successfully
within the wal_receiver_timeout, so it was restarted later.
When i disabled the timeout, all the WAL on te server was retained. There
were still periodical retries, there was no pending request for a lock. Or
it would have gotten one the moment I killed that long transaction.
But every time that the snapshot process was retried, there was already a
new long transaction preventing the lock for the snapshot.
I had to prevent long transactions for long enough so that the initial
snapshot could get a lock on the subscriber. However, I don't know the
length of the timeout that defines how long a transaction can be without
disturbing the snapshot.
Please correct me where I'm wrong.

Cheers,

Willy Loos


On Fri, Jun 11, 2021 at 9:51 AM Willy-Bas Loos  wrote:

> Hi, I was going to follow up on this one, sorry for the long silence.
> The replication is working fine now, and I have no idea what the problem
> was. Not cool.
> If I find out, I will let you know.
>
> On Mon, May 31, 2021 at 6:06 PM Tomas Pospisek  wrote:
>
>> Hi Willy-Bas Loos,
>>
>> On 31.05.21 17:32, Willy-Bas Loos wrote:
>> >
>> >
>> > On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain
>> > > > <mailto:vijaykumarjain.git...@gmail.com>> wrote:
>> >
>> > So I got it all wrong it seems :)
>> >
>> > Thank you for taking the time to help me!
>> >
>> > You upgraded to pg13 fine? , but while on pg13 you have issues with
>> > logical replication ?
>> >
>> > Yes, the upgrade went fine. So here are some details:
>> > I already had londiste running on postgres 9.3, but londiste wouldn't
>> > run on Debian 10
>> > So i first made the new server Debian 9 with postgres 9.6 and i started
>> > replicating with londiste from 9.3 to 9.6
>> > When all was ready, i stopped the replication to the 9.6 server and
>> > deleted all londiste & pgq content with drop schema cascade.
&

Re: find replication slots that "belong" to a publication

2025-04-07 Thread Willy-Bas Loos
Hi Justin, thanks for your anwer!
My question is not so much about "can i drop a certain replication slot",
more about "does this publication still have any replication slots?". Or,
if you will: "what's the publication for this replication slot?".

I've double checked the views that you suggested, and I found that I can
relate the WAL sender processes to replication slots through
pg_replication_slots.active_pid .
I've also looked into replication origins.

But I can't find a link to the publication. And that's what I need to know.

Cheers,
Willy-Bas

On Sun, Apr 6, 2025 at 3:36 PM Justin  wrote:

> On Fri, Apr 4, 2025 at 4:58 AM Willy-Bas Loos  wrote:
>
>> Hi!
>>
>> I'm looking for a way to find out if there are still replication slots
>> active for a publication before dropping the publication in an automated
>> way. The idea is that the publication is thought not to be needed any
>> longer, but we want to make sure.
>>
>> I'm having trouble finding a link between a publication, the
>> subscriptions and the replication slots. Especially when you don't want to
>> make assumptions about any subscriber nodes, so you are restricted to the
>> publisher node.
>>
>> The best I could find was a query listed in pg_stat_activity that lists
>> the slot name and the publication name:
>> START_REPLICATION SLOT "my_slot" LOGICAL 5DD1/3E56D360 (proto_version
>> '1', publication_names '"my_publication"')
>>
>> I don't like the idea of using string manipulation on such query strings
>> to get the information I need. Postgres must have a way to compose this
>> query.
>> Can anyone tell me a way to find replication slots that belong to a
>> publication?
>>
>> --
>> Willy-Bas Loos
>>
>
> Hi Willy-Bas,
>
> Logical replication slots appear in the views pg_stat_replication and
> pg_replication_slots.  Both views have the information you are looking for,
> the difference is pg_stat_replication shows only the active slots.  Keep in
> mind Temporary Slots only live for the length of the session that created
> it; the slot will appear in both views.
>
> The bigger issue I think you are trying to address is when can a slot be
> dropped safely.  Once a logical replication slot is dropped there is no
> recovery of the slot's lsn position. Probably the best way to decide if a
> slot has been abandoned is how far behind it is. The pg_wal_lsn_diff  can
> be used to figure out how far behind a slot is
>
> https://www.postgresql.org/docs/13/view-pg-replication-slots.html
>
> https://www.postgresql.org/docs/13/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-VIEW
>
>
> https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-BACKUP
>
> Hope this answers your question
>
>
> Justin
>


-- 
Willy-Bas Loos


Re: find replication slots that "belong" to a publication

2025-04-10 Thread Willy-Bas Loos
Hi Laurenz,

Thanks for answering!
I find it very strange, because the publication is needed to make a
subscription, which makes the slot.
Thanks for looking into it and helping me understand.

Cheers!
Willy-Bas Loos


On Mon, Apr 7, 2025 at 3:31 PM Laurenz Albe 
wrote:

> On Mon, 2025-04-07 at 12:16 +0200, Willy-Bas Loos wrote:
> > My question is not so much about "can i drop a certain replication
> slot",
> > more about "does this publication still have any replication slots?".
> > Or, if you will: "what's the publication for this replication slot?".
> >
> > I've double checked the views that you suggested, and I found that I can
> relate
> > the WAL sender processes to replication slots through
> pg_replication_slots.active_pid .
> > I've also looked into replication origins.
> >
> > But I can't find a link to the publication. And that's what I need to
> know.
>
> I don't think that there is a connection between a publication and a
> replication slot.  That connection is only made when a subscriber connects
> and runs the START_REPLICATION command [1] and specifies the "pgoutput"
> plugin with the "publication_names" option [2].
>
> I don't think you can see that information reflected in a system view
> on the primary.  You'd have to query "pg_subscription" on the standby.
>
> Yours,
> Laurenz Albe
>
>
>  [1]:
> https://www.postgresql.org/docs/current/protocol-replication.html#PROTOCOL-REPLICATION-START-REPLICATION-SLOT-LOGICAL
>  [2]:
> https://www.postgresql.org/docs/current/protocol-logical-replication.html#PROTOCOL-LOGICAL-REPLICATION-PARAMS
>


-- 
Willy-Bas Loos


Re: find replication slots that "belong" to a publication

2025-04-04 Thread Willy-Bas Loos
postgres 13 BTW

On Fri, Apr 4, 2025 at 10:58 AM Willy-Bas Loos  wrote:

> Hi!
>
> I'm looking for a way to find out if there are still replication slots
> active for a publication before dropping the publication in an automated
> way. The idea is that the publication is thought not to be needed any
> longer, but we want to make sure.
>
> I'm having trouble finding a link between a publication, the subscriptions
> and the replication slots. Especially when you don't want to make
> assumptions about any subscriber nodes, so you are restricted to the
> publisher node.
>
> The best I could find was a query listed in pg_stat_activity that lists
> the slot name and the publication name:
> START_REPLICATION SLOT "my_slot" LOGICAL 5DD1/3E56D360 (proto_version '1',
> publication_names '"my_publication"')
>
> I don't like the idea of using string manipulation on such query strings
> to get the information I need. Postgres must have a way to compose this
> query.
> Can anyone tell me a way to find replication slots that belong to a
> publication?
>
> --
> Willy-Bas Loos
>


-- 
Willy-Bas Loos


find replication slots that "belong" to a publication

2025-04-04 Thread Willy-Bas Loos
Hi!

I'm looking for a way to find out if there are still replication slots
active for a publication before dropping the publication in an automated
way. The idea is that the publication is thought not to be needed any
longer, but we want to make sure.

I'm having trouble finding a link between a publication, the subscriptions
and the replication slots. Especially when you don't want to make
assumptions about any subscriber nodes, so you are restricted to the
publisher node.

The best I could find was a query listed in pg_stat_activity that lists the
slot name and the publication name:
START_REPLICATION SLOT "my_slot" LOGICAL 5DD1/3E56D360 (proto_version '1',
publication_names '"my_publication"')

I don't like the idea of using string manipulation on such query strings to
get the information I need. Postgres must have a way to compose this query.
Can anyone tell me a way to find replication slots that belong to a
publication?

-- 
Willy-Bas Loos