Re: WAL accumulating, Logical Replication pg 13

2021-05-31 Thread Willy-Bas Loos
Thank you for elaborating those possible causes and for the suggestions you
made.
1) if you have an inactive replication slot.
There is only 1 replication slot and it is active. So that is not the issue.

2) Do you have archiving enabled?
No, i never turned it on and so this is in the settings of  both publisher
and subscriber: #archive_mode = off (and show archive_mode; tells me the
same)

3) logical replication can be broken for multiple reasons, like conflicts
where the subscriber already has the data which primary wants to push. it
will not proceed until the conflicts are resolved.
That would have been in the log, but there isn't any conflict in the log.
Only the messages that i posted with my first message.

4) poor connectivity or the computer/network resources not able to keep up
with the load, can result in WAL pile up.
This would be strange since there is a 10Gbps connection within the same
rack. But it could theoretically be malfunctioning or the performance on
the subscriber could be too low.
If any of this is the case, shouldn't that be visible in
pg_stat_subscription ?

Thanks for the article, it's interesting to see how they transitioned from
londiste, even if the article is about pglogical, not logical replication
in the postgres core.
I was using Londiste to transfer the data to the new server and minimize
downtime, so the article might come in handy.
I prepared by reading the documentation, which is very straightforward.
>btw,
>how are you doing logical replication with 9.3 ? using a pglogical
extension ?
No, I'm not using logical replication in postgres 9.3 . Only on postgres 13.
About the link to the bug reports: Thanks for the suggestion. But first I'd
like to get some better grip on what is going on before searching for bugs.

Still, any help will be much appreciated

On Sat, May 29, 2021 at 5:16 PM Vijaykumar Jain <
vijaykumarjain.git...@gmail.com> wrote:

> WAL can be built up for reasons like
> 1) if you have an inactive replication slot. I mean you had a streaming
> replica which was turned off, but you did not remote the slot from primary.
> 2) Do you have archiving enabled? Are the archiving commands running fine
> ? if just the archiving is broken, then you can manually run
> archive cleanup provided, replication is all caught up fine.
>
> 3) logical replication can be broken for multiple reasons, like conflicts
> where the subscriber already has the data which primary wants to push. it
> will not proceed until the conflicts are resolved.
> 4) poor connectivity or the computer/network resources not able to keep up
> with the load, can result in WAL pile up.
>
> there are many blogs around logical replication issues, but when it was
> new in pg10, I read this.
> Recovery use cases for Logical Replication in PostgreSQL 10 | by
> Konstantin Evteev | AvitoTech | Medium
> 
>
> btw,
> how are you doing logical replication with 9.3 ? using a pglogical
> extension ?
> I can try many things, but it would be wrong to make assumptions since i
> did not work with 9.3
> for ex.
> Bug fix: Using ExecCopySlot during multi insert by bdrouvotAWS · Pull
> Request #295 · 2ndQuadrant/pglogical (github.com)
> 
> there are many issues posted here that may be relevant to your setup.
>
>
>
>
>
> On Sat, 29 May 2021 at 19:22, Willy-Bas Loos  wrote:
>
>> Yeah, indexes could slow things down, thanks. Btw I'm not using logical
>> replication for the upgrade, that's not supported for 9.3.
>> It was more complicated but that's beside the point.
>>
>> I could just delete the publication and all that belongs to it and start
>> over. But since I'm trying out logical replication, I would like to be more
>> in control than that. It's there anything that I can dig into to find out
>> why the WAL is accumulating?
>>
>> Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <
>> vijaykumarjain.git...@gmail.com>:
>>
>>> I am not too sure with 9.3
>>> i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
>>> extension)
>>>
>>> one thing to note.
>>> logical replication initiates a copy from a snapshot, then changes from
>>> then on.
>>>
>>> I had a very high insert rate on my source tables (v9.6) and the
>>> destination (v11) could not keep up (it had tons of indexes when I copied
>>> the schema) and it took around a day as the table had around 12 indexes.
>>>
>>> So at the destination(v11), I dropped all but the primary index for each
>>> table, started subscription and when it was almost caught up, rebuilt the
>>> index on the destination concurrently.
>>> it completed in 4-5 hours without stopping the source.
>>> migration completed in a few mins :)
>>>
>>> not sure if this would help, but just FYI.
>>>
>>>
>>> On Sat, 29 May 2021 at 01:36, Willy-Bas Loos  wrote:
>>>
 Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on
 Debia

Re: WAL accumulating, Logical Replication pg 13

2021-05-31 Thread Vijaykumar Jain
So I got it all wrong it seems :)
You upgraded to pg13 fine? , but while on pg13 you have issues with logical
replication ?

There is a path in the postgresql source user subscription folder iirc
which covers various logical replication scenarios.
That may help you just in case.

I have tried all known ways in which replication breaks like above and once
I resolve conflicts it starts fine.
I’ll try to explore more scenarios.
Pardon my link to pglogical. I misunderstood.


On Mon, 31 May 2021 at 7:25 PM Willy-Bas Loos  wrote:

> Thank you for elaborating those possible causes and for the suggestions
> you made.
> 1) if you have an inactive replication slot.
> There is only 1 replication slot and it is active. So that is not the
> issue.
>
> 2) Do you have archiving enabled?
> No, i never turned it on and so this is in the settings of  both publisher
> and subscriber: #archive_mode = off (and show archive_mode; tells me the
> same)
>
> 3) logical replication can be broken for multiple reasons, like conflicts
> where the subscriber already has the data which primary wants to push. it
> will not proceed until the conflicts are resolved.
> That would have been in the log, but there isn't any conflict in the log.
> Only the messages that i posted with my first message.
>
> 4) poor connectivity or the computer/network resources not able to keep up
> with the load, can result in WAL pile up.
> This would be strange since there is a 10Gbps connection within the same
> rack. But it could theoretically be malfunctioning or the performance on
> the subscriber could be too low.
> If any of this is the case, shouldn't that be visible in
> pg_stat_subscription ?
>
> Thanks for the article, it's interesting to see how they transitioned from
> londiste, even if the article is about pglogical, not logical replication
> in the postgres core.
> I was using Londiste to transfer the data to the new server and minimize
> downtime, so the article might come in handy.
> I prepared by reading the documentation, which is very straightforward.
> >btw,
> >how are you doing logical replication with 9.3 ? using a pglogical
> extension ?
> No, I'm not using logical replication in postgres 9.3 . Only on postgres
> 13.
> About the link to the bug reports: Thanks for the suggestion. But first
> I'd like to get some better grip on what is going on before searching for
> bugs.
>
> Still, any help will be much appreciated
>
> On Sat, May 29, 2021 at 5:16 PM Vijaykumar Jain <
> vijaykumarjain.git...@gmail.com> wrote:
>
>> WAL can be built up for reasons like
>> 1) if you have an inactive replication slot. I mean you had a streaming
>> replica which was turned off, but you did not remote the slot from primary.
>> 2) Do you have archiving enabled? Are the archiving commands running fine
>> ? if just the archiving is broken, then you can manually run
>> archive cleanup provided, replication is all caught up fine.
>>
>> 3) logical replication can be broken for multiple reasons, like conflicts
>> where the subscriber already has the data which primary wants to push. it
>> will not proceed until the conflicts are resolved.
>> 4) poor connectivity or the computer/network resources not able to keep
>> up with the load, can result in WAL pile up.
>>
>> there are many blogs around logical replication issues, but when it was
>> new in pg10, I read this.
>> Recovery use cases for Logical Replication in PostgreSQL 10 | by
>> Konstantin Evteev | AvitoTech | Medium
>> 
>>
>> btw,
>> how are you doing logical replication with 9.3 ? using a pglogical
>> extension ?
>> I can try many things, but it would be wrong to make assumptions since i
>> did not work with 9.3
>> for ex.
>> Bug fix: Using ExecCopySlot during multi insert by bdrouvotAWS · Pull
>> Request #295 · 2ndQuadrant/pglogical (github.com)
>> 
>> there are many issues posted here that may be relevant to your setup.
>>
>>
>>
>>
>>
>> On Sat, 29 May 2021 at 19:22, Willy-Bas Loos  wrote:
>>
>>> Yeah, indexes could slow things down, thanks. Btw I'm not using logical
>>> replication for the upgrade, that's not supported for 9.3.
>>> It was more complicated but that's beside the point.
>>>
>>> I could just delete the publication and all that belongs to it and start
>>> over. But since I'm trying out logical replication, I would like to be more
>>> in control than that. It's there anything that I can dig into to find out
>>> why the WAL is accumulating?
>>>
>>> Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <
>>> vijaykumarjain.git...@gmail.com>:
>>>
 I am not too sure with 9.3
 i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
 extension)

 one thing to note.
 logical replication initiates a copy from a snapshot, then changes from
 then on.

 I had a very high insert rate on my source tables (v9.6) 

Re: WAL accumulating, Logical Replication pg 13

2021-05-31 Thread Willy-Bas Loos
On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain <
vijaykumarjain.git...@gmail.com> wrote:

> So I got it all wrong it seems :)
>
Thank you for taking the time to help me!

You upgraded to pg13 fine? , but while on pg13 you have issues with logical
> replication ?
>
Yes, the upgrade went fine. So here are some details:
I already had londiste running on postgres 9.3, but londiste wouldn't run
on Debian 10
So i first made the new server Debian 9 with postgres 9.6 and i started
replicating with londiste from 9.3 to 9.6
When all was ready, i stopped the replication to the 9.6 server and deleted
all londiste & pgq content with drop schema cascade.
Then I upgraded the server to Debian  10. Then i user pg_upgrade to upgrade
from postgres 9.6 to 13. (PostGIS versions were kept compatible).
Then I added logical replication and a third server as a subscriber.

I was going to write that replication is working fine (since the table
contains a lot of data and there are no conflicts in the log), but it turns
out that it isn't.
The subscriber is behind and It looks like there hasn't been any incoming
data after the initial data synchronization.
So at least now i know that the WAL is being retained with a reason. The
connection is working properly (via psql anyway)

I will also look into how to diagnose this from the system tables, e.g.
substracting LSN's to get some quantitative measure  for the lag.



> There is a path in the postgresql source user subscription folder iirc
> which covers various logical replication scenarios.
> That may help you just in case.
>
OK, so comments in the source code you mean?


Re: WAL accumulating, Logical Replication pg 13

2021-05-31 Thread Vijaykumar Jain
This part.
If you can read perl :),

https://github.com/postgres/postgres/tree/master/src/test/subscription/t

On Mon, May 31, 2021, 9:02 PM Willy-Bas Loos  wrote:

>
>
> On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain <
> vijaykumarjain.git...@gmail.com> wrote:
>
>> So I got it all wrong it seems :)
>>
> Thank you for taking the time to help me!
>
> You upgraded to pg13 fine? , but while on pg13 you have issues with
>> logical replication ?
>>
> Yes, the upgrade went fine. So here are some details:
> I already had londiste running on postgres 9.3, but londiste wouldn't run
> on Debian 10
> So i first made the new server Debian 9 with postgres 9.6 and i started
> replicating with londiste from 9.3 to 9.6
> When all was ready, i stopped the replication to the 9.6 server and
> deleted all londiste & pgq content with drop schema cascade.
> Then I upgraded the server to Debian  10. Then i user pg_upgrade to
> upgrade from postgres 9.6 to 13. (PostGIS versions were kept compatible).
> Then I added logical replication and a third server as a subscriber.
>
> I was going to write that replication is working fine (since the table
> contains a lot of data and there are no conflicts in the log), but it turns
> out that it isn't.
> The subscriber is behind and It looks like there hasn't been any incoming
> data after the initial data synchronization.
> So at least now i know that the WAL is being retained with a reason. The
> connection is working properly (via psql anyway)
>
> I will also look into how to diagnose this from the system tables, e.g.
> substracting LSN's to get some quantitative measure  for the lag.
>
>
>
>> There is a path in the postgresql source user subscription folder iirc
>> which covers various logical replication scenarios.
>> That may help you just in case.
>>
> OK, so comments in the source code you mean?
>
>


Re: WAL accumulating, Logical Replication pg 13

2021-05-31 Thread Tomas Pospisek

Hi Willy-Bas Loos,

On 31.05.21 17:32, Willy-Bas Loos wrote:



On Mon, May 31, 2021 at 4:24 PM Vijaykumar Jain 
> wrote:


So I got it all wrong it seems :)

Thank you for taking the time to help me!

You upgraded to pg13 fine? , but while on pg13 you have issues with
logical replication ? 


Yes, the upgrade went fine. So here are some details:
I already had londiste running on postgres 9.3, but londiste wouldn't 
run on Debian 10
So i first made the new server Debian 9 with postgres 9.6 and i started 
replicating with londiste from 9.3 to 9.6
When all was ready, i stopped the replication to the 9.6 server and 
deleted all londiste & pgq content with drop schema cascade.
Then I upgraded the server to Debian  10. Then i user pg_upgrade to 
upgrade from postgres 9.6 to 13. (PostGIS versions were kept compatible).

Then I added logical replication and a third server as a subscriber.

I was going to write that replication is working fine (since the table 
contains a lot of data and there are no conflicts in the log), but it 
turns out that it isn't.
The subscriber is behind and It looks like there hasn't been any 
incoming data after the initial data synchronization.
So at least now i know that the WAL is being retained with a reason. The 
connection is working properly (via psql anyway)


I once maybe had a similar problem due to some ports that were needed 
for replication being firewalled off or respectively the master having 
the wrong IP address of the old master (now standby server) or such.


There was absolutely no word anywhere in any log about the problem I was 
just seeing the new postgres master not starting up after hours and 
hours of waiting after a failover. I somehow found out about the 
required port being blocked (I don't remember - maybe seing the 
unanswered SYNs in tcpdump? Or via ufw log entries?).


I will also look into how to diagnose this from the system tables, e.g. 
substracting LSN's to get some quantitative measure  for the lag.




There is a path in the postgresql source user subscription folder
iirc which covers various logical replication scenarios.
That may help you just in case.

OK, so comments in the source code you mean?