Re: Need help debugging slow logical replication

2023-02-07 Thread sunyuc...@gmail.com
Hi Justin:

- i checked that I have 2 tables using replication identity FULL, but
one table is empty and one table has only 1 row
- 7 tables using index
- overall I have ~100 tables in the publication:  But I do find that
some tables doesn't have PK , but using replica identity default, I'm
abit confused since the doc mentioned that that means it is using PK,
does that mean it is using FULL?

Cheers.

On Tue, Feb 7, 2023 at 4:20 PM Justin  wrote:
>
>
> On Tue, Feb 7, 2023 at 6:38 PM sunyuc...@gmail.com  
> wrote:
>>
>> Hi there,
>>
>> I am using PG 14.14 on both primary and secondary DB on AWS, setup
>> using a logical replication, I'm having trouble with huge replication
>> lag.
>>
>> My setup is as follows:
>>
>> P1 - physical - P1-R
>>   | (logical)
>> P2 - physical - P2-R
>>
>>
>> The lag between P1 & P1-R , P2 & P2-R are both minimal , less than
>> 10seconds , but I'm seeing several hours of lag between P1 & P2  .We
>> currently have 1 pub/sub that covers about 100 tables.
>>
>> Here is the output from P1 showing there is a Lag of at least 80GB
>> (keep growing)
>>
>> > SELECT now() AS CURRENT_TIME,
>>slot_name,
>>active,
>>active_pid,confirmed_flush_lsn,
>>pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),
>> confirmed_flush_lsn)) AS diff_size,
>>pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS 
>> diff_bytes
>> FROM pg_replication_slots
>> WHERE slot_type = 'logical';
>> -[ RECORD 1 ]---+---
>> current_time| 2023-02-07 23:26:00.733+00
>> slot_name   | upgrade
>> active  | t
>> active_pid  | 5180
>> confirmed_flush_lsn | 26B09/8C08C610
>> diff_size   | 81 GB
>> diff_bytes  | 86573472240
>>
>> Here is what pg_stat_replication shows: note that the write_lag is
>> very high: we previously had to set wal_sender_timeout to 0, otherwise
>> the logical replication work keep exiting and fail.
>>
>> > select * from pg_stat_replication;
>> -[ RECORD 1 ]+--
>> pid  | xxx
>> usesysid | xxx
>> usename  | dev
>> application_name | upgrade_target
>> client_addr  | 10.xxx
>> client_hostname  |
>> client_port  | 27404
>> backend_start| 2023-02-07 23:02:39.228572+00
>> backend_xmin | 2909692747
>> state| catchup
>> sent_lsn | 26B09/8C08C610
>> write_lsn| 26B09/840514C0
>> flush_lsn| 26B09/840514C0
>> replay_lsn   | 26B09/840514C0
>> write_lag| 00:07:03.60362
>> flush_lag| 00:07:03.60362
>> replay_lag   | 00:07:03.60362
>> sync_priority| 0
>> sync_state   | async
>> reply_time   | 2023-02-07 23:22:08.245066+00
>>
>> So, the problem seems to be like P2 takes a very long time to apply a
>> chunk,  Mow, my question is why?  From AWS performance insight, it
>> looks like there is only 1 process running on P2, that is 100% blocked
>> by CPU. I don't really understand why it is being blocked by the CPU,
>> and any help in further debugging is much appreciated.
>>
>> Cheers.
>
>
> Hi Sir,
>
> What is the replica identity being used on the tables?  Are any of the tables 
> using  REPLICA IDENTITY FULL ?
>
> How many tables are being replicated?




Re: Need help debugging slow logical replication

2023-02-07 Thread sunyuc...@gmail.com
Hi Justin, thanks for the response!

> REPLICA IDENTITY DEFAULT   will only use primary keys,  if the publisher  
> includes those tables, the subscriber when  replaying the WAL will stop 
> throwing an error not knowing how to replay the UPDATE/DELETE.

But I don't see any errors being thrown out in the postgresql logs?
Should I be seeing it complain there? Is postgresql falling back to
replica identity full here?

However I checked that table, it doesn't seem to be making progress at
all: so I suspect you are right that it is the problem.

> Logical Replication is most likely broken at this point.
>
> I suggest stopping logical replication and correcting tables that don't have 
> qualifying indexes for logical replication by creating the necessary indexes 
> and avoid using replica identity full.  Then restart logical replication from 
> the beginning.
>
>




Re: Need help debugging slow logical replication

2023-02-07 Thread sunyuc...@gmail.com
ccing the mailist again in case someone else would have a idea how to debug:

Here is what I see in the pg_locks on subscribe at all time:

As you can see, it mostly concern following 3 tables, accounts,
ledger, pending_ledger_fees, which I have tried analyze, vacuum them
etc, none of them helped much:

I do see in the log these tables are getting autovaccumed *very*
frequently, is that a problem for logical replication?

2023-02-08 04:59:47 UTC::@:[13626]:LOG:  automatic vacuum of table
"dev.public.accounts": index scans: 0
pages: 0 removed, 71437 remain, 0 skipped due to pins, 11465 skipped frozen
tuples: 0 removed, 3590987 remain, 3522495 are dead but not yet
removable, oldest xmin: 2893907681
index scan bypassed: 4 pages from table (0.01% of total) have 5 dead
item identifiers
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 120107 hits, 0 misses, 0 dirtied
WAL usage: 1 records, 0 full page images, 708 bytes
system usage: CPU: user: 0.23 s, system: 0.00 s, elapsed: 1.01 s
2023-02-08 04:59:47 UTC::@:[13626]:LOG:  automatic vacuum of table
"dev.public.pending_fee_ledger": index scans: 0
pages: 0 removed, 199818 remain, 0 skipped due to pins, 117769 skipped frozen
tuples: 0 removed, 3670095 remain, 352 are dead but not yet
removable, oldest xmin: 2893907681
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
I/O timings: read: 0.000 ms, write: 0.000 ms
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 164212 hits, 0 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 0.27 s, system: 0.00 s, elapsed: 0.27 s

Thanks

On Tue, Feb 7, 2023 at 6:10 PM sunyuc...@gmail.com  wrote:
>
> Hi Justin, thanks for the response!
>
> > REPLICA IDENTITY DEFAULT   will only use primary keys,  if the publisher  
> > includes those tables, the subscriber when  replaying the WAL will stop 
> > throwing an error not knowing how to replay the UPDATE/DELETE.
>
> But I don't see any errors being thrown out in the postgresql logs?
> Should I be seeing it complain there? Is postgresql falling back to
> replica identity full here?
>
> However I checked that table, it doesn't seem to be making progress at
> all: so I suspect you are right that it is the problem.
>
> > Logical Replication is most likely broken at this point.
> >
> > I suggest stopping logical replication and correcting tables that don't 
> > have qualifying indexes for logical replication by creating the necessary 
> > indexes and avoid using replica identity full.  Then restart logical 
> > replication from the beginning.
> >
> >


Re: Need help debugging slow logical replication

2023-02-08 Thread sunyuc...@gmail.com
OK: so I think I've got some new insight by using test_decoding to
peek into the changes being replicate, here is what i think is
happening:

- there is a background job generating a Huge transaction (10K
changes) one after another  that basically does DELETE on 1 table,
INSERT on 1 table and UPDATE on 1 table. Since this is a test
environment, this is the only TX being accumulated.
- since there is only 1 publication,  this  worker on primary
basically spend all its CPU decoding this large transaction then send
to subscriber to execute
   -  memory usage on logical decoding worker maybe a issue: as it
might have spilled to disk during decoding
   -  worker on subscriber basically spend all time applying this change

- I suspect  keep-alives are not sent during applying a transaction,
and the transaction apply time takes longer than wal_sender_timeout
(30seconds) , that's why we have to increase this number before.
- I think utilizing the new streaming feature for logical replication
in PG14 will improve performance here.
- I think the other thing that will help is to drop the index
temporarily on the subscriber to accelerate applying, but will need to
rebuild once close.

BTW: just to clarify Replica identity is not an issue for the table
that only receives INSERT, I tried to create a replica identity for
that table, and it didn't change anything recorded in WAL (as
expected), if anything it actually decreased performance since

What's puzzling for me is that why does logical apply being so slower
than physical apply (and also apparently physical streaming never
times out!)

On Tue, Feb 7, 2023 at 9:13 PM sunyuc...@gmail.com  wrote:
>
> ccing the mailist again in case someone else would have a idea how to debug:
>
> Here is what I see in the pg_locks on subscribe at all time:
>
> As you can see, it mostly concern following 3 tables, accounts,
> ledger, pending_ledger_fees, which I have tried analyze, vacuum them
> etc, none of them helped much:
>
> I do see in the log these tables are getting autovaccumed *very*
> frequently, is that a problem for logical replication?
>
> 2023-02-08 04:59:47 UTC::@:[13626]:LOG:  automatic vacuum of table
> "dev.public.accounts": index scans: 0
> pages: 0 removed, 71437 remain, 0 skipped due to pins, 11465 skipped frozen
> tuples: 0 removed, 3590987 remain, 3522495 are dead but not yet
> removable, oldest xmin: 2893907681
> index scan bypassed: 4 pages from table (0.01% of total) have 5 dead
> item identifiers
> I/O timings: read: 0.000 ms, write: 0.000 ms
> avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
> buffer usage: 120107 hits, 0 misses, 0 dirtied
> WAL usage: 1 records, 0 full page images, 708 bytes
> system usage: CPU: user: 0.23 s, system: 0.00 s, elapsed: 1.01 s
> 2023-02-08 04:59:47 UTC::@:[13626]:LOG:  automatic vacuum of table
> "dev.public.pending_fee_ledger": index scans: 0
> pages: 0 removed, 199818 remain, 0 skipped due to pins, 117769 skipped frozen
> tuples: 0 removed, 3670095 remain, 352 are dead but not yet
> removable, oldest xmin: 2893907681
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead
> item identifiers removed
> I/O timings: read: 0.000 ms, write: 0.000 ms
> avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
> buffer usage: 164212 hits, 0 misses, 0 dirtied
> WAL usage: 0 records, 0 full page images, 0 bytes
> system usage: CPU: user: 0.27 s, system: 0.00 s, elapsed: 0.27 s
>
> Thanks
>
> On Tue, Feb 7, 2023 at 6:10 PM sunyuc...@gmail.com  
> wrote:
> >
> > Hi Justin, thanks for the response!
> >
> > > REPLICA IDENTITY DEFAULT   will only use primary keys,  if the publisher  
> > > includes those tables, the subscriber when  replaying the WAL will stop 
> > > throwing an error not knowing how to replay the UPDATE/DELETE.
> >
> > But I don't see any errors being thrown out in the postgresql logs?
> > Should I be seeing it complain there? Is postgresql falling back to
> > replica identity full here?
> >
> > However I checked that table, it doesn't seem to be making progress at
> > all: so I suspect you are right that it is the problem.
> >
> > > Logical Replication is most likely broken at this point.
> > >
> > > I suggest stopping logical replication and correcting tables that don't 
> > > have qualifying indexes for logical replication by creating the necessary 
> > > indexes and avoid using replica identity full.  Then restart logical 
> > > replication from the beginning.
> > >
> > >




Re: Advice on cluster architecture for two related, but distinct, use cases

2024-11-11 Thread sunyuc...@gmail.com
CC

On Mon, Nov 11, 2024 at 09:28 Greg Sabino Mullane 
wrote:

> Some of those requirements are vague, but yes, Patroni should probably be
> the first approach you look at. If the second datacenter is just for
> redundancy, then a simple setup would be:
>
> DCA (data center A):
> Postgres server 1
> Postgres server 2
>
> DCB:
> Postgres server 3 (set no_failover: true)
>
> You will also need a DCS system of some sort (e.g. etcd on all three
> nodes), as well as a backup system (e.g. pgBackRest). Will also need to
> decide how automated you want things to be (for example, cross datacenter
> failover in the above would be manually done). It should definitely be able
> to handle your RPO/RTO requirements easily enough.
>
> [Patroni] However it seems relatively complex to set up and operate
>
>
> Setting things up can be a little complex, yes, but once done it just
> works, so very little operation resources are needed.
>
> We can not assume eg a load balancer.
>
>
> Possible via the application: see
> https://www.postgresql.org/docs/current/libpq-connect.html (esp.
> target_session_attrs)
>
> Cheers,
> Greg
>
>