Performance penalty during logical postgres replication

2020-12-09 Thread Lars Vonk
Hi,

We are doing a logical postgres replication from Postgres 11 to 12. Our
database is around 700GB (8 cpu's, 32 GB).
During the replication process, at some point, we see a huge performance
penalty on a particular table. This table acts as a queue with lots of
inserts and deletes happening throughout the day. For most of the time this
table is empty, but during this performance penalty the number of rows in
this table grows to 10.000 rows, and processing is not fast enough to empty
this table. Main reason for this (as far as we see) is that the performance
of the query for selecting the next row to process drops from < 10MS to
400MS. This eventually causes too much cpu load on the Primary and we have
to cancel the replication process.
We already tried the initial load three times, and it consistently fails
with the same "error". Last try was a per table approach and excluding this
"queue" table.
After cancelling the replication the query is fast again and the load on
the Primary goes back to normal. We see that this happens when replicating
large tables (> millions of rows). During this performance penalty the
explain of the query selecting the next row from this table tells us it is
doing a sequential scan (there is an index but it is not used).

- What could cause this performance penalty?
- Is this something other people experienced as well during the initial
load of a logical replication with large tables?
- We are now thinking of temporarily increasing the number of CPU's and RAM
for the migration. Would this help in this case?

Thanks in advance,
Lars


Re: Performance penalty during logical postgres replication

2020-12-10 Thread Lars Vonk
Hi,

- on the receiving side, avoid creating indexes on the tables: create just
> a necessary PK or UK, wait for the initial load to complete and then add
> all the rest ones
>

Thanks, this is a good tip. We are going to add this

We also noticed the code that was getting the next from the "queue" was
doing the query with 'select for update', but without a 'SKIP LOCKED'. This
is probably also something that  caused more wait time if the server is
more busy as usual during the replication. So we are going to add this and
try again. We are also minimizing load on the queue during initial
replication.

On to the next try.

Lars

On Wed, Dec 9, 2020 at 6:45 PM Victor Yegorov  wrote:

> ср, 9 дек. 2020 г. в 10:21, Lars Vonk :
>
>> We are doing a logical postgres replication from Postgres 11 to 12. Our
>> database is around 700GB (8 cpu's, 32 GB).
>> During the replication process, at some point, we see a huge performance
>> penalty on a particular table. This table acts as a queue with lots of
>> inserts and deletes happening throughout the day. For most of the time this
>> table is empty, but during this performance penalty the number of rows in
>> this table grows to 10.000 rows, and processing is not fast enough to empty
>> this table. Main reason for this (as far as we see) is that the performance
>> of the query for selecting the next row to process drops from < 10MS to
>> 400MS. This eventually causes too much cpu load on the Primary and we have
>> to cancel the replication process.
>>
> We already tried the initial load three times, and it consistently fails
>> with the same "error". Last try was a per table approach and excluding this
>> "queue" table.
>> After cancelling the replication the query is fast again and the load on
>> the Primary goes back to normal. We see that this happens when replicating
>> large tables (> millions of rows). During this performance penalty the
>> explain of the query selecting the next row from this table tells us it is
>> doing a sequential scan (there is an index but it is not used).
>>
>> - What could cause this performance penalty?
>> - Is this something other people experienced as well during the initial
>> load of a logical replication with large tables?
>> - We are now thinking of temporarily increasing the number of CPU's and
>> RAM for the migration. Would this help in this case?
>>
>
> I've seen similar symptoms in cases with (a) home-made queues in the
> tables and (b) long transactions.
> Unfortunately, queue requires frequent vacuuming to preserve more or less
> constant size of the queue and it's indexes.
> And long transactions prevent the vacuum from cleaning up the queue.
> Initial synchronization phase of the logical replication is in fact such a
> transaction.
>
> I would recommend doing the following:
> - avoid adding ALL tables to the publication
> - instead, split all tables in a batches in such a way, that initial batch
> processing takes limited time (say, 15-30 minutes at most)
> - of course, this leaves the biggest tables alone — add those one by one
> to the publication, preferably at the time slot with minimal load on the
> queue.
> - make sure to catch up on the queue processing and vacuum it between
> batches
> - on the receiving side, avoid creating indexes on the tables: create just
> a necessary PK or UK, wait for the initial load to complete and then add
> all the rest ones
>
> As for the queue, PGQ from skytools is using different approach to
> maintain queue tables:
> - once in a while (2 hours by default) processing is switched to a new
> table, tab_1, tab_2, tab_3 are used in a round
> - after the switch, any remaining entries can be moved from previous to
> the live table (shouldn't be necessary if switch is done properly, although
> might be tricky in a presence of a long transactions)
> - previous table is TRUNCATEd
>
> In your case, you can do `VACUUM FULL` between replicating each batch of
> tables.
>
> --
> Victor Yegorov
>


Re: Performance penalty during logical postgres replication

2020-12-10 Thread Lars Vonk
It has been 4 hours and it is safe to say that the measurements we took
have a huge positive effect: > 30 times faster and no noticeable effect on
the running Primary at all.
A 20GB table is now replicated under 10 minutes.

- We removed all non PK and unique indices from the large tables
- We the changed the query on the queue table to add the 'SKIP LOCKED'
clause.
- We do a per table approach for the larger tables.

I think the indices have the most significant impact, but not sure how to
proof this since we did multiple changes at the same time.

Thanks again for the tips!

-- Lars


On Thu, Dec 10, 2020 at 9:12 AM Lars Vonk  wrote:

> Hi,
>
> - on the receiving side, avoid creating indexes on the tables: create just
>> a necessary PK or UK, wait for the initial load to complete and then add
>> all the rest ones
>>
>
> Thanks, this is a good tip. We are going to add this
>
> We also noticed the code that was getting the next from the "queue" was
> doing the query with 'select for update', but without a 'SKIP LOCKED'. This
> is probably also something that  caused more wait time if the server is
> more busy as usual during the replication. So we are going to add this and
> try again. We are also minimizing load on the queue during initial
> replication.
>
> On to the next try.
>
> Lars
>
> On Wed, Dec 9, 2020 at 6:45 PM Victor Yegorov  wrote:
>
>> ср, 9 дек. 2020 г. в 10:21, Lars Vonk :
>>
>>> We are doing a logical postgres replication from Postgres 11 to 12. Our
>>> database is around 700GB (8 cpu's, 32 GB).
>>> During the replication process, at some point, we see a huge performance
>>> penalty on a particular table. This table acts as a queue with lots of
>>> inserts and deletes happening throughout the day. For most of the time this
>>> table is empty, but during this performance penalty the number of rows in
>>> this table grows to 10.000 rows, and processing is not fast enough to empty
>>> this table. Main reason for this (as far as we see) is that the performance
>>> of the query for selecting the next row to process drops from < 10MS to
>>> 400MS. This eventually causes too much cpu load on the Primary and we have
>>> to cancel the replication process.
>>>
>> We already tried the initial load three times, and it consistently fails
>>> with the same "error". Last try was a per table approach and excluding this
>>> "queue" table.
>>> After cancelling the replication the query is fast again and the load on
>>> the Primary goes back to normal. We see that this happens when replicating
>>> large tables (> millions of rows). During this performance penalty the
>>> explain of the query selecting the next row from this table tells us it is
>>> doing a sequential scan (there is an index but it is not used).
>>>
>>> - What could cause this performance penalty?
>>> - Is this something other people experienced as well during the initial
>>> load of a logical replication with large tables?
>>> - We are now thinking of temporarily increasing the number of CPU's and
>>> RAM for the migration. Would this help in this case?
>>>
>>
>> I've seen similar symptoms in cases with (a) home-made queues in the
>> tables and (b) long transactions.
>> Unfortunately, queue requires frequent vacuuming to preserve more or less
>> constant size of the queue and it's indexes.
>> And long transactions prevent the vacuum from cleaning up the queue.
>> Initial synchronization phase of the logical replication is in fact such a
>> transaction.
>>
>> I would recommend doing the following:
>> - avoid adding ALL tables to the publication
>> - instead, split all tables in a batches in such a way, that initial
>> batch processing takes limited time (say, 15-30 minutes at most)
>> - of course, this leaves the biggest tables alone — add those one by one
>> to the publication, preferably at the time slot with minimal load on the
>> queue.
>> - make sure to catch up on the queue processing and vacuum it between
>> batches
>> - on the receiving side, avoid creating indexes on the tables: create
>> just a necessary PK or UK, wait for the initial load to complete and then
>> add all the rest ones
>>
>> As for the queue, PGQ from skytools is using different approach to
>> maintain queue tables:
>> - once in a while (2 hours by default) processing is switched to a new
>> table, tab_1, tab_2, tab_3 are used in a round
>> - after the switch, any remaining entries can be moved from previous to
>> the live table (shouldn't be necessary if switch is done properly, although
>> might be tricky in a presence of a long transactions)
>> - previous table is TRUNCATEd
>>
>> In your case, you can do `VACUUM FULL` between replicating each batch of
>> tables.
>>
>> --
>> Victor Yegorov
>>
>


Missing rows after logical replication in new primary

2020-12-18 Thread Lars Vonk
Hi,

We migrated from postgres 11 to 12 using logical replication. Today we
noticed that one table is missing 1252 rows after the replication finished
and we flipped to the new primary (we still have the old so we can recover).

We see that these rows were inserted in the table after starting the
initial copy of the table. Most of the missing rows seem from new inserts
happening **during the initial copy** (1230) and the rest (22) from inserts
**during the period the replication ran** (7 days).

This table is a (for us) high volume table (> 400.000.000 rows), with daily
> 150.000  new inserts.

We took a per-table approach for the replication and this table was the
last table we started in our replication.

We did some sanity checks before we switched to the new master, like
comparing max(id) to see if the replica was up to date (including this
table) and counts on some tables and that all checked out okay.

So how can this happen? For now it seems that only this table suffered from
it, but we are pretty 'scared' more tables are affected, so we will have to
check them all.

Lars


Missing rows after migrating from postgres 11 to 12 with logical replication

2020-12-18 Thread Lars Vonk
Hi,

We migrated from postgres 11 to 12 using logical replication (over local
network). Today we noticed that one table is missing 1252 rows after the
replication finished and we flipped to the new primary (we still have the
old master database so we can recover).

We see that these rows were inserted in the table after starting the
initial copy of the table. Most of the missing rows seem from new inserts
happening **during the initial copy** (1230) and the rest (22) from inserts
**during the period the replication ran** (7 days).

After further investigation unfortunately more tables have missing rows,
all of them are after the initial table copy phase. We took a per-table
approach for the replication, starting with creating an empty publication
and adding tables via

ALTER PUBLICATION pg12_migration ADD TABLE FOO

After that we refreshed the publication on the "new postgres 12 primary"
using

ALTER SUBSCRIPTION pg12_migration REFRESH PUBLICATION;

We only added new tables after the the initial copy of the previous was
done (the internal state was replicating).

We never stopped the subscriptions during all this and we started with a
fresh schema.

We did some sanity checks before we switched to the new master, like
comparing max(id) to see if the replica was up to date (including this
table) and counts on some smaller tables and that all checked out okay, we
never thought of missing rows somewhere in between

So how can this happen?

Lars


Posts not appearing in this mailinglist

2020-12-19 Thread Lars Vonk
Hi,

Yesterday I tried (twice) to report an issue / ask a question on Postgres
logical replication on this user group, but both emails didn't appear in
this mailing list.

What could be the cause of that?

Thanks in advance,
Lars


Re: Posts not appearing in this mailinglist

2020-12-19 Thread Lars Vonk
Hi,

Thanks for replying I see them now as well.. Superweird, they still don't
show up in my gmail and didn't notice them yesterday via the web as well
:-/.

Sorry for posting it twice then...

Lars

On Sat, Dec 19, 2020 at 11:51 AM Herbert J. Skuhra 
wrote:

> On Sat, Dec 19, 2020 at 11:38:09AM +0100, Lars Vonk wrote:
> > Hi,
> >
> > Yesterday I tried (twice) to report an issue / ask a question on Postgres
> > logical replication on this user group, but both emails didn't appear in
> > this mailing list.
>
> https://www.postgresql.org/list/pgsql-general/since/20201218/
>
> I received both messages from you.
>
> --
> Herbert
>


Re: Missing rows after migrating from postgres 11 to 12 with logical replication

2020-12-20 Thread Lars Vonk
Hi,

Just wondering if someone knows how this could have happened? Did we miss
out on something when setting up the logical replication? Are there any
scenario's in which this could happen (like database restart or anything
else?).
Or should I report this a bug (although I can't image it is)?
We really would like to know how we can prevent this from happening the
next time.

We still have the old primary, and a snapshot of the current primary around
the time we flipped from the old to the new. So we could some digging into
the cause, but we don't know what to look for...

Any help or tips are appreciated.

Thanks in advance,

Lars


On Fri, Dec 18, 2020 at 4:42 PM Lars Vonk  wrote:

> Hi,
>
> We migrated from postgres 11 to 12 using logical replication (over local
> network). Today we noticed that one table is missing 1252 rows after the
> replication finished and we flipped to the new primary (we still have the
> old master database so we can recover).
>
> We see that these rows were inserted in the table after starting the
> initial copy of the table. Most of the missing rows seem from new inserts
> happening **during the initial copy** (1230) and the rest (22) from inserts
> **during the period the replication ran** (7 days).
>
> After further investigation unfortunately more tables have missing rows,
> all of them are after the initial table copy phase. We took a per-table
> approach for the replication, starting with creating an empty publication
> and adding tables via
>
> ALTER PUBLICATION pg12_migration ADD TABLE FOO
>
> After that we refreshed the publication on the "new postgres 12 primary"
> using
>
> ALTER SUBSCRIPTION pg12_migration REFRESH PUBLICATION;
>
> We only added new tables after the the initial copy of the previous was
> done (the internal state was replicating).
>
> We never stopped the subscriptions during all this and we started with a
> fresh schema.
>
> We did some sanity checks before we switched to the new master, like
> comparing max(id) to see if the replica was up to date (including this
> table) and counts on some smaller tables and that all checked out okay, we
> never thought of missing rows somewhere in between
>
> So how can this happen?
>
> Lars
>


Re: Missing rows after migrating from postgres 11 to 12 with logical replication

2020-12-21 Thread Lars Vonk
Hi Adrian,

Thanks for taking the time to reply!

First to answer your questions:

1) Was there activity on the 12 instance while it was being replicated
> to that could account for the missing(deleted?) rows?
>

No there was no activity other than us doing some queries to check how far
the replication was.

2) Are the logs still available for inspection to see if there where any
> errors thrown?
>

Yes, and we dug into those. And we also found some indications that
something went wrong.

3) Are there FK relationships involved?
>

No

4) How did you determine the rows where missing?


We were alerted by a bug later that day and found that some rows were
missing in the new primary. We did a compare based on primary key and found
that several tables were missing rows. Before the switch we unfortunately
only checked max(id) and did some counts on tables and those all checked
out. We didn't do a count on all tables...

So to come back at the logs:

We dug a little deeper and we did found ERROR logs around the time we ran
the initial copies. During a period of several hours that day we see a
couple of messages like:

ERROR: requested WAL segment 000101F1001D has already been
> removed
>

This message is logged a few times and then no more (perhaps it recovered
from it?)

Other than this error there are no other errors, but unfortunately we never
checked this before migrating to the new primary...
In hindsight not very smart of course, but we never thought of this because:

a) the initial copy and the catching up all seemed fine;
b) in previous attempts when we made some errors we noticed for instance
that the WAL files on the previous primary were kept because the new
primary did not yet process them.
So we assumed when all WAL files are "gone" and the max(id) checks out the
replica is in sync and consistent with the primary;
c) our experience with hotstandby replication is that whenever a WAL
segment is missing it won't skip over it and wait until you restore it. We
assumed (and still assume) that this was also the case with logical
replication;

So the questions we now have is:

1) is it correct that a replica of a logical replication skips over missing
WAL files.
2) if so how can you know that it skipped a wal without looking at the log
files or doing a full count?
3) Is there a fail fast mechanism for logical replication (like hotstandy)
that when a WAL file is missing that it stops with further replication

Regards,
Lars

On Sun, Dec 20, 2020 at 6:58 PM Adrian Klaver 
wrote:

> On 12/20/20 8:33 AM, Lars Vonk wrote:
> > Hi,
> >
> > Just wondering if someone knows how this could have happened? Did we
> > miss out on something when setting up the logical replication? Are there
> > any scenario's in which this could happen (like database restart or
> > anything else?).
> > Or should I report this a bug (although I can't image it is)?
> > We really would like to know how we can prevent this from happening the
> > next time.
> >
> > We still have the old primary, and a snapshot of the current primary
> > around the time we flipped from the old to the new. So we could some
> > digging into the cause, but we don't know what to look for...
> >
>
> Questions I have:
>
> 1) Was there activity on the 12 instance while it was being replicated
> to that could account for the missing(deleted?) rows?
>
> 2) Are the logs still available for inspection to see if there where any
> errors thrown?
>
> 3) Are there FK relationships involved?
>
> 4) How did you determine the rows where missing?
>
>
> > Any help or tips are appreciated.
> >
> > Thanks in advance,
> >
> > Lars
> >
> >
> > On Fri, Dec 18, 2020 at 4:42 PM Lars Vonk  > <mailto:lars.v...@gmail.com>> wrote:
> >
> > Hi,
> >
> > We migrated from postgres 11 to 12 using logical replication (over
> > local network). Today we noticed that one table is missing 1252 rows
> > after the replication finished and we flipped to the new primary (we
> > still have the old master database so we can recover).
> >
> > We see that these rows were inserted in the table after starting the
> > initial copy of the table. Most of the missing rows seem from new
> > inserts happening **during the initial copy** (1230) and the rest
> > (22) from inserts **during the period the replication ran** (7 days).
> >
> > After further investigation unfortunately more tables have missing
> > rows, all of them are after the initial table copy phase. We took a
> > per-table approach for the replication, starting with creating an
> > empty publication and adding tables via
> >
> > ALT

Re: Missing rows after migrating from postgres 11 to 12 with logical replication

2020-12-21 Thread Lars Vonk
>
> What was being run when the above ERROR was triggered?


The initial copy of a table. Other than that we ran select
pg_size_pretty(pg_relation_size('table_name')) to see the current size of
the table being copied to get a feeling on progress.

And whenever we added a new table to the publication we ran ALTER
SUBSCRIPTION migration REFRESH PUBLICATION; to add any new table to the
subscription. But not around that timestamp, about 50 minutes before the
first occurence of that ERROR. (no ERRORS after prior ALTER SUBSCRIPTIONs).

But after the initial copy's ended there are more ERROR's on different WAL
segments missing. Each missing wal segment is logged as ERROR a couple of
times and then no more. After a couple of hours no errors are logged.

Lars

On Mon, Dec 21, 2020 at 10:23 PM Adrian Klaver 
wrote:

> On 12/21/20 12:26 PM, Lars Vonk wrote:
> > Hi Adrian,
> >
> > Thanks for taking the time to reply!
>
> >
> > 2) Are the logs still available for inspection to see if there where
> > any
> > errors thrown?
> >
> >
> > Yes, and we dug into those. And we also found some indications that
> > something went wrong.
> >
>
> > 4) How did you determine the rows where missing?
> >
> >
> > We were alerted by a bug later that day and found that some rows were
> > missing in the new primary. We did a compare based on primary key and
> > found that several tables were missing rows. Before the switch we
> > unfortunately only checked max(id) and did some counts on tables and
> > those all checked out. We didn't do a count on all tables...
> >
> > So to come back at the logs:
> >
> > We dug a little deeper and we did found ERROR logs around the time we
> > ran the initial copies. During a period of several hours that day we see
> > a couple of messages like:
> >
> > ERROR: requested WAL segment 000101F1001D has already
> > been removed
>
> What was being run when the above ERROR was triggered?
>
>
> >
> > Regards,
> > Lars
> >
> > On Sun, Dec 20, 2020 at 6:58 PM Adrian Klaver 
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: Missing rows after migrating from postgres 11 to 12 with logical replication

2020-12-22 Thread Lars Vonk
>
> Did you have some other replication running on the 11 instance?
>

Yes the 11 instance also had another (11) replica running. (But these logs
are from the 12 instance)

The new 12 instance also had a replica running.

In any case what was the command logged just before the ERROR.
>

There is nothing logged.

These are the only log statements just before the error message, one second
later the ERROR is logged:

2020-12-10 13:26:43 UTC::@:[5537]:LOG:  checkpoints are occurring too
frequently (20 seconds apart)
2020-12-10 13:26:43 UTC::@:[5537]:HINT:  Consider increasing the
configuration parameter "max_wal_size".
2020-12-10 13:26:43 UTC::@:[5537]:LOG:  checkpoint starting: wal

Lars

On Mon, Dec 21, 2020 at 11:51 PM Adrian Klaver 
wrote:

> On 12/21/20 2:42 PM, Lars Vonk wrote:
> > What was being run when the above ERROR was triggered?
> >
> >
> > The initial copy of a table. Other than that we ran select
> > pg_size_pretty(pg_relation_size('table_name')) to see the current size
> > of the table being copied to get a feeling on progress.
> >
> > And whenever we added a new table to the publication we ran ALTER
> > SUBSCRIPTION migration REFRESH PUBLICATION; to add any new table to the
> > subscription. But not around that timestamp, about 50 minutes before the
> > first occurence of that ERROR. (no ERRORS after prior ALTER
> SUBSCRIPTIONs).
> >
> > But after the initial copy's ended there are more ERROR's on different
> > WAL segments missing. Each missing wal segment is logged as ERROR a
> > couple of times and then no more. After a couple of hours no errors are
> > logged.
>
> Something was looking for the WAL segment.
>
> Did you have some other replication running on the 11 instance?
>
> In any case what was the command logged just before the ERROR.
>
> >
> > Lars
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: Missing rows after migrating from postgres 11 to 12 with logical replication

2020-12-23 Thread Lars Vonk
The full setup is:

**Before:
11 primary -> 11 hotstandby binary

**During migration
11 primary -> 11 hotstandby binary
  | -> 12 new instance via logical
  |-> 12 new replica via binary

**After migration
12 primary
|-> 12 replica via binary


On Tue, Dec 22, 2020 at 7:16 PM Adrian Klaver 
wrote:

> On 12/22/20 9:10 AM, Lars Vonk wrote:
> > Did you have some other replication running on the 11 instance?
> >
> >
> > Yes the 11 instance also had another (11) replica running. (But these
> > logs are from the 12 instance)
>
> The 11 instance had the data that went missing in the 12 instance, so
> what shows up in logs for the 11 instance during this period that is
> relevant?
>
> >
> > The new 12 instance also had a replica running.
>
> So the setup was?:
>
> 1) 11 primary --> 11 standby via what replication logical or binary?
>  | --> 12 new instance via logical
>
> 2) 12(new) primary --> 12(?) standby via what replication logical or
> binary?
>
> >
> > In any case what was the command logged just before the ERROR.
> >
> >
> > There is nothing logged.
> >
> > These are the only log statements just before the error message, one
> > second later the ERROR is logged:
> >
> > 2020-12-10 13:26:43 UTC::@:[5537]:LOG:  checkpoints are occurring too
> > frequently (20 seconds apart)
> > 2020-12-10 13:26:43 UTC::@:[5537]:HINT:  Consider increasing the
> > configuration parameter "max_wal_size".
> > 2020-12-10 13:26:43 UTC::@:[5537]:LOG:  checkpoint starting: wal
> >
> > Lars
> >
> > On Mon, Dec 21, 2020 at 11:51 PM Adrian Klaver
> > mailto:adrian.kla...@aklaver.com>> wrote:
> >
> > On 12/21/20 2:42 PM, Lars Vonk wrote:
> >  > What was being run when the above ERROR was triggered?
> >  >
> >  >
> >  > The initial copy of a table. Other than that we ran select
> >  > pg_size_pretty(pg_relation_size('table_name')) to see the current
> > size
> >  > of the table being copied to get a feeling on progress.
> >  >
> >  > And whenever we added a new table to the publication we ran ALTER
> >  > SUBSCRIPTION migration REFRESH PUBLICATION; to add any new table
> > to the
> >  > subscription. But not around that timestamp, about 50 minutes
> > before the
> >  > first occurence of that ERROR. (no ERRORS after prior ALTER
> > SUBSCRIPTIONs).
> >  >
> >  > But after the initial copy's ended there are more ERROR's on
> > different
> >  > WAL segments missing. Each missing wal segment is logged as ERROR
> a
> >  > couple of times and then no more. After a couple of hours no
> > errors are
> >  > logged.
> >
> > Something was looking for the WAL segment.
> >
> > Did you have some other replication running on the 11 instance?
> >
> > In any case what was the command logged just before the ERROR.
> >
> >  >
> >  > Lars
> >  >
> >
> >
> > --
> > Adrian Klaver
> > adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: Missing rows after migrating from postgres 11 to 12 with logical replication

2020-12-24 Thread Lars Vonk
Well thanks for taking the time anyway. Indeed next time reduce the parts
is a good idea.

I would still expect though that if a logical replica misses a WAL it would
stop replicating (and / or report an inconsistent state). I know this is
the case with binary replication (it stops replication).
As a last question, do you know if this is also the case with logical
replication as well, or is what happened here an "expected outcome" when a
logical replica misses a WAL?

Lars

On Thu, Dec 24, 2020 at 5:52 PM Adrian Klaver 
wrote:

> On 12/23/20 1:40 AM, Lars Vonk wrote:
> > The full setup is:
> >
> > **Before:
> > 11 primary -> 11 hotstandby binary
> >
> > **During migration
> > 11 primary -> 11 hotstandby binary
> >| -> 12 new instance via logical
> >|-> 12 new replica via binary
> >
> > **After migration
> > 12 primary
> > |-> 12 replica via binary
> >
> >
>
> There are several moving parts here. I have to believe the problem is
> related. Just not sure how to figure it out after the fact. The best I
> can come up with is retry the process and monitor closely in real or
> near real time to see if you can catch the issue. Another option is to
> reduce the parts count by not running the binary 12 --> 12 replication
> at the same time you are doing the 11 --> 12 logical replication.
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Performance issue on GIN index with gin_trgm_ops index column

2022-07-27 Thread Lars Vonk
 Hi all,

 We are investigating a performance issue with searching on a GIN with
gin_trgm_ops indexed column. This specific (organization_id,aggregate_type)
has in total 19K records and the (organization_id) has in total 30K
records. The search record table has in total 38M records.

 The table and index are defined as follows:

   Column  |   Type| Collation | Nullable |
  Default

-+---+---+--+-
   id  | bigint|   | not null |
nextval('search_records_id_seq'::regclass)
   organization_id | uuid  |   | not null |
   aggregate_id| uuid  |   | not null |
   aggregate_type  | character varying |   | not null |
   document| text  |   |  |
  Indexes:
  "search_records_pkey" PRIMARY KEY, btree (id)
  "search_records_keys" UNIQUE, btree (organization_id, aggregate_id,
aggregate_type)
  "search_records_btree_gin" gin ((organization_id::character varying),
aggregate_type, document gin_trgm_ops)

The query we execute is:

  select aggregate_id from search_records where organization_id::varchar =
'975097c5-e760-4603-9236-fcf2e8580a7c' and aggregate_type = 'FooRecord' and
document ilike '%user.n...@gmail.com%';

Resulting in the following plan:

   Bitmap Heap Scan on search_records  (cost=2184.00..2188.02 rows=1
width=104) (actual time=4332.007..4332.008 rows=1 loops=1)
 Recheck Cond: organization_id)::character varying)::text =
'975097c5-e760-4603-9236-fcf2e8580a7c'::text) AND ((aggregate_type)::text =
'FooRecord'::text) AND (document ~~* '%user.n...@gmail.com%'::text))
 Heap Blocks: exact=1
 Buffers: shared hit=23920 read=9752
 I/O Timings: read=4017.360
 ->  Bitmap Index Scan on search_records_btree_gin  (cost=0.00..2184.00
rows=1 width=0) (actual time=4331.987..4331.987 rows=1 loops=1)
   Index Cond: organization_id)::character varying)::text =
'975097c5-e760-4603-9236-fcf2e8580a7c'::text) AND ((aggregate_type)::text =
'FooRecord'::text) AND (document ~~* '%user.n...@gmail.com%'::text))
   Buffers: shared hit=23920 read=9751
   I/O Timings: read=4017.355
   Planning Time: 0.268 ms
   Execution Time: 4332.030 ms
  (11 rows)

We are running on Postgres RDS with engine version 12.8 with 32GB memory
and 8GB shared_buffer. We have 442GB of 2000GB diskspace left.

Sometimes we also have queries for this particular customer which take more
than 20 seconds. The content of the document in those cases are similar to:

   User Name Kees postgresstreet Amsterdam 1000 AA user.n...@gmail.com 1234

Are we doing something wrong? I find the I/O timings quite high, does this
mean that it took 4000MS to read the 9752 blocks from the disk?

Any other tips and or suggestions are welcome.

Kind regards,
Lars Vonk