from:"Mike Lissner"

Logical replication DNS cache

2019-12-11 Thread Mike Lissner

I've got a server at example.com that currently publishes logical
replication to a server in AWS RDS. I plan to move the server at example.com
so that it has a new IP address (but same domain name).

I'm curious if anybody knows how the logical replication subscriber in AWS
would handle that.

There's at least three layers where the DNS might be cached, creating
breakage once the move is complete:

 - Postgres itself

 - AWS's postgresql fork in RDS might have something

 - The OS underlying amazon's RDS service

I expect this is a tough question unless somebody has done this before, but
any ideas on how postgresql would handle this kind of thing? Or is there a
way to flush the DNS cache that postgresql (or RDS or the OS) has?

I'm just beginning to explore this, but if anybody has experience, I'd love
to hear it.

Thanks,


Mike

How to shorten a chain of logically replicated servers

2019-12-31 Thread Mike Lissner

Hi, I'm trying to figure out how to shorten a chain of logically
replicating servers. Right now we have three servers replicating like
so:

A --> B --> C

And I'd like to remove B from the chain of replication so that I only have:

A --> C

Of course, doing this without losing data is the goal. If the
replication to C breaks temporarily, that's fine, so long as all the
changes on A make it to C eventually.

I'm not sure how to proceed with this. My best theory is:

1. In a transaction, DISABLE the replication from A to B and start a
new PUBLICATION on A that C will subscribe to in step ③ below. The
hope is that this will simultaneously stop sending changes to B while
starting a log of new changes that can later be sent to C.

2. Let any changes queued on B flush to C. (How to know when they're
all flushed?)

3. Subscribe C to the new PUBLICATION created in step ①. Create the
subscription with copy_data=False. This should send all changes to C
that hadn't been sent to B, without sending the complete tables.

4. DROP all replication to/from B (this is just cleanup; the incoming
changes to B were disabled in step ①, and outgoing changes from B were
flushed in step ②).

Does this sound even close to the right approach? Logical replication
can be a bit finicky, so I'd love to have some validation of the
general approach before I go down this road.

Thanks everybody and happy new year,

Mike

Re: How to shorten a chain of logically replicated servers

2020-01-06 Thread Mike Lissner

Hi, I don't usually like to bump messages on this list, but since I
sent mine on New Year's Eve, I figured I'd better. Anybody have any
ideas about how to accomplish this? I'm pretty stumped (as you can
probably see).

On Tue, Dec 31, 2019 at 3:51 PM Mike Lissner
 wrote:
>
> Hi, I'm trying to figure out how to shorten a chain of logically
> replicating servers. Right now we have three servers replicating like
> so:
>
> A --> B --> C
>
> And I'd like to remove B from the chain of replication so that I only have:
>
> A --> C
>
> Of course, doing this without losing data is the goal. If the
> replication to C breaks temporarily, that's fine, so long as all the
> changes on A make it to C eventually.
>
> I'm not sure how to proceed with this. My best theory is:
>
> 1. In a transaction, DISABLE the replication from A to B and start a
> new PUBLICATION on A that C will subscribe to in step ③ below. The
> hope is that this will simultaneously stop sending changes to B while
> starting a log of new changes that can later be sent to C.
>
> 2. Let any changes queued on B flush to C. (How to know when they're
> all flushed?)
>
> 3. Subscribe C to the new PUBLICATION created in step ①. Create the
> subscription with copy_data=False. This should send all changes to C
> that hadn't been sent to B, without sending the complete tables.
>
> 4. DROP all replication to/from B (this is just cleanup; the incoming
> changes to B were disabled in step ①, and outgoing changes from B were
> flushed in step ②).
>
> Does this sound even close to the right approach? Logical replication
> can be a bit finicky, so I'd love to have some validation of the
> general approach before I go down this road.
>
> Thanks everybody and happy new year,
>
> Mike

Re: How to shorten a chain of logically replicated servers

2020-01-07 Thread Mike Lissner

> You'd have to suspend all data modification on A in that interval.

I know how to stop the DB completely, but I can't think of any obvious
ways to make sure that it doesn't get any data modification for a
period of time. Is there a trick here? This is feeling a bit hopeless.

Thanks for the response, Laurenz.

On Tue, Jan 7, 2020 at 3:11 AM Laurenz Albe  wrote:
>
> On Tue, 2019-12-31 at 15:51 -0800, Mike Lissner wrote:
> > Hi, I'm trying to figure out how to shorten a chain of logically
> > replicating servers. Right now we have three servers replicating like
> > so:
> >
> > A --> B --> C
> >
> > And I'd like to remove B from the chain of replication so that I only have:
> >
> > A --> C
> >
> > Of course, doing this without losing data is the goal. If the
> > replication to C breaks temporarily, that's fine, so long as all the
> > changes on A make it to C eventually.
> >
> > I'm not sure how to proceed with this. My best theory is:
> >
> > 1. In a transaction, DISABLE the replication from A to B and start a
> > new PUBLICATION on A that C will subscribe to in step ③ below. The
> > hope is that this will simultaneously stop sending changes to B while
> > starting a log of new changes that can later be sent to C.
> >
> > 2. Let any changes queued on B flush to C. (How to know when they're
> > all flushed?)
> >
> > 3. Subscribe C to the new PUBLICATION created in step ①. Create the
> > subscription with copy_data=False. This should send all changes to C
> > that hadn't been sent to B, without sending the complete tables.
> >
> > 4. DROP all replication to/from B (this is just cleanup; the incoming
> > changes to B were disabled in step ①, and outgoing changes from B were
> > flushed in step ②).
> >
> > Does this sound even close to the right approach? Logical replication
> > can be a bit finicky, so I'd love to have some validation of the
> > general approach before I go down this road.
>
> I don't think that will work.
>
> Any changes on A that take place between step 1 and step 3 wouldn't be
> replicated to C.
>
> You'd have to suspend all data modification on A in that interval.
>
> Yours,
> Laurenz Albe
> --
> Cybertec | https://www.cybertec-postgresql.com
>

Re: How to shorten a chain of logically replicated servers

2020-01-08 Thread Mike Lissner

That's a good trick, thanks again for the help.

Boy, this promises to be a dumb process! I'm unqualified to guess at
what might make this easier, but it does seem like something that
should have some kind of low-level tools that could do the job.

On Wed, Jan 8, 2020 at 1:53 AM Laurenz Albe  wrote:
>
> On Tue, 2020-01-07 at 23:17 -0800, Mike Lissner wrote:
> > > You'd have to suspend all data modification on A in that interval.
> >
> > I know how to stop the DB completely, but I can't think of any obvious
> > ways to make sure that it doesn't get any data modification for a
> > period of time. Is there a trick here? This is feeling a bit hopeless.
>
> The simplest solution would be to stop the applications that use PostgreSQL.
>
> You could block client connections using a "pg_hba.conf" entry
> (and kill the established connections).
>
> Another option can be to set "default_transaction_read_only = on",
> but that will only work if the clients don't override it explicitly.
>
> Yours,
> Laurenz Albe
> --
> Cybertec | https://www.cybertec-postgresql.com
>

Is it safe to transfer logical replication publication/subscription?

2020-01-08 Thread Mike Lissner

Hi all, this is a follow up from an earlier question I asked about
shortening a chain of logically replicating servers. Right now we have
three servers replicating like this:

A --> B --> C

And we want to remove B so that we have:

A --> C

Is it possible to DROP the subscription on B to A and then to
SUBSCRIBE C to the previously used publication on A without losing
data?

E.g., assuming the following:

 - "A" has a PUBLICATION named "A-to-B-Pub" that "B" subscribes to.
 - "C" subscribes to "B" with a subscription named "B-to-C-Sub".

Would this work?

1. On B, DROP the subscription to "A-to-B-Pub".

2. Let any cached changes on B flush to C. Give it an hour to be sure.

3. On C ALTER  "B-to-C-Sub" to subscribe to the now-used "A-to-B-Pub" on A.

Seems like this would either work perfectly or totally fail. Any ideas?

Thanks for any help,


Mike

Re: Is it safe to transfer logical replication publication/subscription?

2020-01-08 Thread Mike Lissner

That's a great point, thanks. The DROP SUBSCRIPTION notes say you can:

> Disassociate the subscription from the replication slot by executing ALTER 
> SUBSCRIPTION ... SET (slot_name = NONE). After that, DROP SUBSCRIPTION will 
> no longer attempt any actions on a remote host.

I'll read some more about the replication slots themselves (I did read
about them a while back), but doing the above seems like a good way to
break B from A, before resubscribing C to A instead?

I feel like this is getting warmer.

Thanks for the reply. I really appreciate it.

Mike

On Wed, Jan 8, 2020 at 2:46 PM Peter Eisentraut
 wrote:
>
> On 2020-01-08 22:22, Mike Lissner wrote:
> > Hi all, this is a follow up from an earlier question I asked about
> > shortening a chain of logically replicating servers. Right now we have
> > three servers replicating like this:
> >
> > A --> B --> C
> >
> > And we want to remove B so that we have:
> >
> > A --> C
> >
> > Is it possible to DROP the subscription on B to A and then to
> > SUBSCRIBE C to the previously used publication on A without losing
> > data?
>
> What you are not taking into account here are replication slots, which
> are the low-level mechanism to keep track of what a replication client
> has consumed.  When you drop the subscription on B, that (by default)
> also drops the associated replication slot on A, and therefore you lose
> the information of how much B has consumed from A.  (This assumes that
> there is concurrent write activity on A, otherwise this is uninteresting.)
>
> What you need to do instead is disassociate the B-from-A subscription
> from the replication slot on A, then let all changes from B trickle to
> C, then change the C-from-B subscription to replicate from A and use the
> existing replication slot on A.
>
> See
> https://www.postgresql.org/docs/current/logical-replication-subscription.html#LOGICAL-REPLICATION-SUBSCRIPTION-SLOT
> for details.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Is it safe to transfer logical replication publication/subscription?

2020-01-09 Thread Mike Lissner

Thank you Peter, this is wildly helpful.

On Thu, Jan 9, 2020 at 7:52 AM Peter Eisentraut
 wrote:
>
> On 2020-01-08 23:55, Mike Lissner wrote:
> > That's a great point, thanks. The DROP SUBSCRIPTION notes say you can:
> >
> >> Disassociate the subscription from the replication slot by executing ALTER 
> >> SUBSCRIPTION ... SET (slot_name = NONE). After that, DROP SUBSCRIPTION 
> >> will no longer attempt any actions on a remote host.
> >
> > I'll read some more about the replication slots themselves (I did read
> > about them a while back), but doing the above seems like a good way to
> > break B from A, before resubscribing C to A instead?
>
> Yes, that's the one you want.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Does converting an indexed varchar to text rewrite its index? Docs say so, tests say no.

2020-01-23 Thread Mike Lissner

I think the docs say that if you convert a varchar to text, it'll rewrite
the index, but my test doesn't seem to indicate that. Is the test or the
documentation wrong?

If the docs, I'll be happy to make a fix my first contribution to
postgresql. :)

Here are the docs:

(https://www.postgresql.org/docs/10/sql-altertable.html)

> [...] changing the type of an existing column will require the entire
table and its indexes to be rewritten. As an exception when changing the
type of an existing column, if the USING clause does not change the column
contents and the old type is either binary coercible to the new type or an
unconstrained domain over the new type, a table rewrite is not needed; but *any
indexes on the affected columns must still be rebuilt.*

And the test:

postgres=# CREATE TABLE t1 (id serial PRIMARY KEY, name character
varying(30));
CREATE TABLE
Time: 25.927 ms
postgres=# INSERT INTO t1 (id) SELECT generate_series(1,100) i;
INSERT 0 100
Time: 2080.416 ms (00:02.080)
postgres=# CREATE INDEX ON t1 (name);
CREATE INDEX
Time: 463.373 ms *<-- Index takes ~500ms to build*
postgres=# ALTER TABLE t1 ALTER COLUMN name TYPE text;
ALTER TABLE
Time: 19.698 ms *<-- Alter takes 20ms to run (no rebuild, right?)*

Thanks!

Mike

Re: Does converting an indexed varchar to text rewrite its index? Docs say so, tests say no.

2020-01-23 Thread Mike Lissner

Thanks Adrian. Is there a reason that the index rebuild is nearly instant
during the ALTER command as opposed to when you build it from scratch?

Does it have to do with why this is called a "toast" index?

DEBUG:  building index "pg_toast_37609_index" on table "pg_toast_37609"

Thanks for the feedback. I really appreciate it and it's super interesting
to learn about.

Mike

On Thu, Jan 23, 2020 at 9:54 AM Adrian Klaver 
wrote:

> On 1/23/20 8:55 AM, Mike Lissner wrote:
> > I think the docs say that if you convert a varchar to text, it'll
> > rewrite the index, but my test doesn't seem to indicate that. Is the
> > test or the documentation wrong?
> >
> > If the docs, I'll be happy to make a fix my first contribution to
> > postgresql. :)
> >
> > Here are the docs:
> >
> > (https://www.postgresql.org/docs/10/sql-altertable.html)
> >
> >  > [...] changing the type of an existing column will require the entire
> > table and its indexes to be rewritten. As an exception when changing the
> > type of an existing column, if the USING clause does not change the
> > column contents and the old type is either binary coercible to the new
> > type or an unconstrained domain over the new type, a table rewrite is
> > not needed; but *any indexes on the affected columns must still be
> rebuilt.*
> >
> > And the test:
> >
> > postgres=# CREATE TABLE t1 (id serial PRIMARY KEY, name character
> > varying(30));
> > CREATE TABLE
> > Time: 25.927 ms
> > postgres=# INSERT INTO t1 (id) SELECT generate_series(1,100) i;
> > INSERT 0 100
> > Time: 2080.416 ms (00:02.080)
> > postgres=# CREATE INDEX ON t1 (name);
> > CREATE INDEX
> > Time: 463.373 ms *<-- Index takes ~500ms to build*
> > postgres=# ALTER TABLE t1 ALTER COLUMN name TYPE text;
> > ALTER TABLE
> > Time: 19.698 ms *<-- Alter takes 20ms to run (no rebuild, right?)*
>
>
> I going to say it is the exception to the exception, in that in Postgres
> varchar and text are essentially the same type.
>
> FYI there is a reindex going on:
>
> test=> set client_min_messages = debug1;
> test=>  CREATE TABLE t1 (id serial PRIMARY KEY, name character
> varying(30));
> LOG:  statement: CREATE TABLE t1 (id serial PRIMARY KEY, name character
> varying(30));
> DEBUG:  CREATE TABLE will create implicit sequence "t1_id_seq" for
> serial column "t1.id"
> DEBUG:  CREATE TABLE / PRIMARY KEY will create implicit index "t1_pkey"
> for table "t1"
> DEBUG:  building index "t1_pkey" on table "t1" serially
> CREATE TABLE
> test=> INSERT INTO t1 (id) SELECT generate_series(1,100) i;
> LOG:  statement: INSERT INTO t1 (id) SELECT generate_series(1,100) i;
> INSERT 0 100
> test=>  CREATE INDEX ON t1 (name);
> LOG:  statement: CREATE INDEX ON t1 (name);
> DEBUG:  building index "t1_name_idx" on table "t1" with request for 1
> parallel worker
> CREATE INDEX
> test=> ALTER TABLE t1 ALTER COLUMN name TYPE text;
> LOG:  statement: ALTER TABLE t1 ALTER COLUMN name TYPE text;
> DEBUG:  building index "pg_toast_37609_index" on table "pg_toast_37609"
> serially
> ALTER TABLE
>
> >
> > Thanks!
> >
> > Mike
> > **
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>

Re: Does converting an indexed varchar to text rewrite its index? Docs say so, tests say no.

2020-01-23 Thread Mike Lissner

You wrote:

 > Well it did not rebuilt the index("t1_name_idx") you created on name.

OK, so then the docs *are* wrong? They say that:

> any indexes on the affected columns must still be rebuilt.

But that doesn't happen? Sorry to be persistent. I'm just a bit confused
here.



On Thu, Jan 23, 2020 at 11:28 AM Adrian Klaver 
wrote:

> On 1/23/20 11:17 AM, Mike Lissner wrote:
> > Thanks Adrian. Is there a reason that the index rebuild is nearly
> > instant during the ALTER command as opposed to when you build it from
> > scratch?
>
> Well it did not rebuilt the index("t1_name_idx") you created on name.
>
> >
> > Does it have to do with why this is called a "toast" index?
>
> Certain data types(those that have varlena) can have portions of their
> data stored in an auxiliary table in a compressed(or not) form. For all
> the details see:
>
> https://www.postgresql.org/docs/12/storage-toast.html
>
> The index is the one on this auxiliary table.
>
> >
> > DEBUG:  building index "pg_toast_37609_index" on table "pg_toast_37609"
> >
> > Thanks for the feedback. I really appreciate it and it's super
> > interesting to learn about.
> >
> > Mike
> >
>
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>

Better documentation for schema changes in logical replication

2020-02-04 Thread Mike Lissner

Hi all,

I've been using logical replication for about a year now, and I wonder if
there's any sense that it needs better documentation of schema changes. My
experience is that there's almost no documentation and that there are lots
of opportunities to really screw things up.

It seems like starting somewhere would be good. I'd propose an outline
something like the following:

1. General rules of replication schema changes (do we ALTER the
SUBSCRIPTION to DISABLE it first?) What types of best practices do we have?

2. How to do basic things like add/remove a column/table, etc

3. Particular things that cause issues like making a field NULLable. I'm
sure there are a handful of these I haven't run into yet.

My current practice is to set up logical replication across two docker
images and test things out before doing it in production, but every time I
do so I learn something new, despite having carefully read the
documentation. Here's an example of me trying to figure out how to DROP
COLUMNs:

https://github.com/freelawproject/courtlistener/issues/1164

Is this something others think should be improved? I'm not sure I'm
qualified, though I'm keeping a lot of notes about my tests, as above.

Mike

Errors with schema migration and logical replication — expected?

2018-12-08 Thread Mike Lissner

Hi, first time poster.

I just ran into a rather messy problem when doing a schema migration with
logical replication. I'm not entirely sure what went wrong, why, or how to
prevent it in the future. The migration I ran was pretty simple (though
auto-generated by Django):

BEGIN;ALTER TABLE "search_docketentry" ADD COLUMN
"pacer_sequence_number" smallint NULL;ALTER TABLE "search_docketentry"
ALTER COLUMN "pacer_sequence_number" DROP DEFAULT;ALTER TABLE
"search_docketentry" ADD COLUMN "recap_sequence_number" varchar(50)
DEFAULT '' NOT NULL;ALTER TABLE "search_docketentry" ALTER COLUMN
"recap_sequence_number" DROP DEFAULT;ALTER TABLE "search_docketentry"
ALTER COLUMN "entry_number" DROP NOT NULL;ALTER TABLE
"search_recapdocument" ALTER COLUMN "document_number" SET DEFAULT
'';ALTER TABLE "search_recapdocument" ALTER COLUMN "document_number"
DROP DEFAULT;ALTER TABLE "search_docketentry" DROP CONSTRAINT
"search_docketentry_docket_id_12fd448b9aa007ca_uniq";CREATE INDEX
"search_docketentry_recap_sequence_number_1c82e51988e2d89f_idx" ON
"search_docketentry" ("recap_sequence_number", "entry_number");CREATE
INDEX "search_docketentry_eb19fcf7" ON "search_docketentry"
("pacer_sequence_number");CREATE INDEX "search_docketentry_bff4d47b"
ON "search_docketentry" ("recap_sequence_number");CREATE INDEX
"search_docketentry_recap_sequence_number_d700f0391e8213a_like" ON
"search_docketentry" ("recap_sequence_number" varchar_pattern_ops);
COMMIT;
BEGIN;ALTER TABLE "search_docketentry"
ALTER COLUMN "pacer_sequence_number" TYPE integer;
COMMIT;


And after running this migration, I started getting this error on the
subscriber:

2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply
worker for subscription "replicasubscription" has started
2018-12-09 05:59:45 UTC::@:[13373]:ERROR: null value in column
"recap_sequence_number" violates not-null constraint
2018-12-09 05:59:45 UTC::@:[13373]:DETAIL: Failing row contains
(48064261, 2018-12-07 04:48:40.388377+00, 2018-12-07
04:48:40.388402+00, null, 576, , 4571214, null, null).
2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process: logical
replication worker for subscription 18390 (PID 13373) exited with exit
code 1


So, my migration created a new column with a null constraint and somehow
the subscriber got data that violated that. I don't know how that's
possible since this was a new column and it was never nullable.

I applied the above migration simultaneously on my publisher and subscriber
thinking that postgresql was smart enough to do the right thing. I think
the subscriber finished first (it has less traffic).

The docs hint that postgresql might be smart enough to not worry about the
order you do migrations:

> *Logical replication is robust when schema definitions change in a live
database:* When the schema is changed on the publisher and replicated data
starts arriving at the subscriber but does not fit into the table schema,
replication will error until the schema is updated.

And it even hints that doing a migration on the subscriber first is a good
thing in some cases:

> In many cases, intermittent errors can be avoided by applying additive
schema changes to the subscriber first.

But I'm now supremely skeptical that doing anything at the subscriber first
is a good idea. Are the docs wrong? Does the above error make sense? Is the
process for schema migrations documented somewhere beyond the above?

I have lots of questions because I thought this would have gone smoother
than it did.

As for the fix: I made the column nullable on the subscriber and I'm
waiting for it to catch up. Once it does I'll re-sync its schema with the
publisher. Anybody interested in following along with all this (or finding
this later and having questions) can follow the issue here:

https://github.com/freelawproject/courtlistener/issues/919

Thank you for the lovely database! I hope this is helpful.

Mike

Re: Errors with schema migration and logical replication — expected?

2018-12-09 Thread Mike Lissner

>
>
> The above seems to be the crux of the problem, how did NULL get into the
> column data?
>
>
I agree. My queries are generated by Django (so I never write SQL myself),
but:

 - the column has always been NOT NULL for its entire lifetime
 - we don't send *any* SQL commands to the replica yet, so that's not a
factor (for now it's just a live backup)
 - the publisher now has a NOT NULL constraint on that column. I never had
to clear out null values to put it in place. I assume that if that column
ever had a null value and I tried to run a DDL to add a null constraint,
the DDL would have failed, right?

Something feels wrong here, the more I think about it.

Re: Errors with schema migration and logical replication — expected?

2018-12-09 Thread Mike Lissner

On Sun, Dec 9, 2018 at 8:43 AM Adrian Klaver 
wrote:

> On 12/9/18 8:03 AM, Mike Lissner wrote:
> >
> > The above seems to be the crux of the problem, how did NULL get into
> > the
> > column data?
> >
> >
> > I agree. My queries are generated by Django (so I never write SQL
> > myself), but:
> >
> >   - the column has always been NOT NULL for its entire lifetime
>
> The lifetime being since the migration did this?:
>
> ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
> varchar(50) DEFAULT '' NOT NULL;
> ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
> DROP DEFAULT;
>

"Lifetime" meaning that there was never a time when this column allowed
nulls.

> Also does the column recap_sequence_number appear in any other tables.
> Just wondering if the error was on another table?
>

Good idea, but no. This column only exists in one table.

> >   - we don't send *any* SQL commands to the replica yet, so that's not a
> > factor (for now it's just a live backup)
> >   - the publisher now has a NOT NULL constraint on that column. I never
> > had to clear out null values to put it in place. I assume that if that
>
> This part confuses me. You seem to imply that the column existed before
> the migration and you just added a NOT NULL constraint. The migration
> shows the column being created with a NOT NULL constraint.
>

Sorry, what I mean is that if *somehow* the master had null values in that
column at some point, which I don't know how would even be possible because
it only came into existence with the command above — if somehow that
happened, I'd know, because I wouldn't have been *able* to add a NULL
constraint without first fixing the data in that column, which I never did.

My contention is that for all these reasons, there should *never* have been
a null value in that column on master.

>
> > column ever had a null value and I tried to run a DDL to add a null
> > constraint, the DDL would have failed, right?
> >
> > Something feels wrong here, the more I think about it.
>
> A start would be to figure out what generated?:
>
> failing row contains (48064261, 2018-12-07 04:48:40.388377+00,
> 2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null)
>

Yes, I completely agree. I can't think of any way that that should have
ever been created.

>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>

Re: Errors with schema migration and logical replication — expected?

2018-12-09 Thread Mike Lissner

On Sun, Dec 9, 2018 at 12:42 PM Adrian Klaver 
wrote:

>
> 1) Using psql have you verified that NOT NULL is set on that column on
> the publisher?
>

Yes, on the publisher and the subscriber. That was my first step when I saw
the log lines about this.

2) And that the row that failed in the subscriber is in the publisher table.
>

Yep, it's there (though it doesn't show a null for that column, and I don't
know how it ever could have).

> 3) That there are no NULL values in the publisher column?
>

This on the publisher:

select * from search_docketentry where recap_sequence_number is null;

returns zero rows, so yeah, no nulls in there (which makes sense since
they're not allowed).

Whatever the answers to 1), 2) and 3) are the next question is:
>
> 4) Do you want/need recap_sequence_number to be NOT NULL.
>

Yes, and indeed that's how it always has been.

a) If not then you could leave things as they are.
>

Well, I was able to fix this by briefly allowing nulls on the subscriber,
letting it catch up with the publisher, setting all nulls to empty strings
(a Django convention), and then disallowing nulls again. After letting it
catch up, there were 118 nulls on the subscriber in this column:

https://github.com/freelawproject/courtlistener/issues/919#issuecomment-445520185

That shouldn't be possible since nulls were never allowed in this column on
the publisher.

> b) If so then you:
>
> 1) Have to figure out what is sending NULL values to the column.
>
> Maybe a model that has null=True set when it shouldn't be?
>

Nope, never had that. I'm 100% certain.

> A Form/ModelForm that is allowing None/Null?
>

Even if that was the case, the error wouldn't have shown up on the
subscriber since that null would have never been allowed in the publisher.
But anyway, I don't use any forms with this column.

> Some code that is operating outside the ORM e.g. doing a
>direct query using from django.db import connection.
>

That's an idea, but like I said, nothing sends SQL to the subscriber (not
even read requests), and this shouldn't have been possible in the publisher
due to the NOT NULL constraint that has *always* been on that column.

 2) Clean up the NULL values in the column in the subscriber
> and/or publisher.
>

There were only NULL values in the subscriber, never in the publisher.
Something is amiss here.

I appreciate all the responses. I'm scared to say so, but I think this is a
bug in logical replication. Somehow a null value appeared at the subscriber
that was never in the publisher.

I also still have this question/suggestion from my first email:

> Is the process for schema migrations documented somewhere beyond the
above?

Thank you again,

Mike

Re: Errors with schema migration and logical replication — expected?

2018-12-11 Thread Mike Lissner

Reupping this since it was over the weekend and looks like a bug in logical
replication. My problems are solved, but some very weird things happened
when doing a schema migration.

On Sun, Dec 9, 2018 at 5:48 PM Mike Lissner 
wrote:

> On Sun, Dec 9, 2018 at 12:42 PM Adrian Klaver 
> wrote:
>
>>
>> 1) Using psql have you verified that NOT NULL is set on that column on
>> the publisher?
>>
>
> Yes, on the publisher and the subscriber. That was my first step when I
> saw the log lines about this.
>
> 2) And that the row that failed in the subscriber is in the publisher
>> table.
>>
>
> Yep, it's there (though it doesn't show a null for that column, and I
> don't know how it ever could have).
>
>
>> 3) That there are no NULL values in the publisher column?
>>
>
> This on the publisher:
>
> select * from search_docketentry where recap_sequence_number is null;
>
> returns zero rows, so yeah, no nulls in there (which makes sense since
> they're not allowed).
>
> Whatever the answers to 1), 2) and 3) are the next question is:
>>
>> 4) Do you want/need recap_sequence_number to be NOT NULL.
>>
>
> Yes, and indeed that's how it always has been.
>
> a) If not then you could leave things as they are.
>>
>
> Well, I was able to fix this by briefly allowing nulls on the subscriber,
> letting it catch up with the publisher, setting all nulls to empty strings
> (a Django convention), and then disallowing nulls again. After letting it
> catch up, there were 118 nulls on the subscriber in this column:
>
>
> https://github.com/freelawproject/courtlistener/issues/919#issuecomment-445520185
>
> That shouldn't be possible since nulls were never allowed in this column
> on the publisher.
>
>
>> b) If so then you:
>>
>> 1) Have to figure out what is sending NULL values to the column.
>>
>> Maybe a model that has null=True set when it shouldn't be?
>>
>
> Nope, never had that. I'm 100% certain.
>
>
>> A Form/ModelForm that is allowing None/Null?
>>
>
> Even if that was the case, the error wouldn't have shown up on the
> subscriber since that null would have never been allowed in the publisher.
> But anyway, I don't use any forms with this column.
>
>
>> Some code that is operating outside the ORM e.g. doing a
>>direct query using from django.db import connection.
>>
>
> That's an idea, but like I said, nothing sends SQL to the subscriber (not
> even read requests), and this shouldn't have been possible in the publisher
> due to the NOT NULL constraint that has *always* been on that column.
>
>  2) Clean up the NULL values in the column in the subscriber
>> and/or publisher.
>>
>
> There were only NULL values in the subscriber, never in the publisher.
> Something is amiss here.
>
> I appreciate all the responses. I'm scared to say so, but I think this is
> a bug in logical replication. Somehow a null value appeared at the
> subscriber that was never in the publisher.
>
> I also still have this question/suggestion from my first email:
>
> > Is the process for schema migrations documented somewhere beyond the
> above?
>
> Thank you again,
>
> Mike
>
>

Re: Errors with schema migration and logical replication — expected?

2018-12-12 Thread Mike Lissner

On Tue, Dec 11, 2018 at 3:10 PM Adrian Klaver 
wrote:

>
> > Well, I was able to fix this by briefly allowing nulls on the
> > subscriber, letting it catch up with the publisher, setting all
> > nulls to empty strings (a Django convention), and then disallowing
> > nulls again. After letting it catch up, there were 118 nulls on the
> > subscriber in this column:
>
> So recap_sequence_number is not actually a number, it is a code?
>

It has sequential values, but they're not necessarily numbers.


>
> >
> > I appreciate all the responses. I'm scared to say so, but I think
> > this is a bug in logical replication. Somehow a null value appeared
> > at the subscriber that was never in the publisher.
> >
> > I also still have this question/suggestion from my first email:
> >
> >  > Is the process for schema migrations documented somewhere beyond
> > the above?
>
> Not that I know of. It might help, if possible, to detail the steps in
> the migration. Also what program you used to do it. Given that is Django
> I am assuming some combination of migrate, makemigrations and/or
> sqlmigrate.
>

Pretty simple/standard, I think:
 - Changed the code.
 - Generated the migration using manage.py makemigration
 - Generated the SQL using sqlmigrate
 - Ran the migration using manage.py migrate on the master and using psql
on the replica

Mike

Re: Errors with schema migration and logical replication — expected?

2018-12-12 Thread Mike Lissner

This sounds *very* plausible. So I think there are a few takeaways:

1. Should the docs mention that additive changes with NOT NULL constraints
are bad?

2. Is there a way this could work without completely breaking replication?
For example, should Postgresql realize replication can't work in this
instance and then stop it until schemas are back in sync, like it does with
other incompatible schema changes? That'd be better than failing in this
way and is what I'd expect to happen.

3. Are there other edge cases like this that aren't well documented that we
can expect to creep up on us? If so, should we try to spell out exactly
*which* additive changes *are* OK?

This feels like a major "gotcha" to me, and I'm trying to avoid those. I
feel like the docs are pretty lacking here and that others will find
themselves in similarly bad positions.

Better schema migration docs would surely help, too.

Mike


On Wed, Dec 12, 2018 at 7:11 AM Adrian Klaver 
wrote:

> On 12/12/18 12:15 AM, Mike Lissner wrote:
> >
> >
> > On Tue, Dec 11, 2018 at 3:10 PM Adrian Klaver  > <mailto:adrian.kla...@aklaver.com>> wrote:
> >
> >
> >  > Well, I was able to fix this by briefly allowing nulls on the
> >  > subscriber, letting it catch up with the publisher, setting
> all
> >  > nulls to empty strings (a Django convention), and then
> > disallowing
> >  > nulls again. After letting it catch up, there were 118 nulls
> > on the
> >  > subscriber in this column:
> >
> > So recap_sequence_number is not actually a number, it is a code?
> >
> >
> > It has sequential values, but they're not necessarily numbers.
> >
> >
> >  >
> >  > I appreciate all the responses. I'm scared to say so, but I
> think
> >  > this is a bug in logical replication. Somehow a null value
> > appeared
> >  > at the subscriber that was never in the publisher.
> >  >
> >  > I also still have this question/suggestion from my first
> email:
> >  >
> >  >  > Is the process for schema migrations documented somewhere
> > beyond
> >  > the above?
> >
> > Not that I know of. It might help, if possible, to detail the steps
> in
> > the migration. Also what program you used to do it. Given that is
> > Django
> > I am assuming some combination of migrate, makemigrations and/or
> > sqlmigrate.
> >
> >
> > Pretty simple/standard, I think:
> >   - Changed the code.
> >   - Generated the migration using manage.py makemigration
> >   - Generated the SQL using sqlmigrate
> >   - Ran the migration using manage.py migrate on the master and using
> > psql on the replica
>
> The only thing I can think of involves this sequence on the subscriber:
>
> ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
> varchar(50) DEFAULT '' NOT NULL;
> ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
> DROP DEFAULT;
>
> and then this:
>
> https://www.postgresql.org/docs/11/logical-replication-subscription.html
>
> "Columns of a table are also matched by name. A different order of
> columns in the target table is allowed, but the column types have to
> match. The target table can have additional columns not provided by the
> published table. Those will be filled with their default values."
>
> https://www.postgresql.org/docs/10/sql-createtable.html
>
> "If there is no default for a column, then the default is null."
>
> So the subscriber finished the migration first, as alluded to in an
> earlier post. There is no data for recap_sequence_number coming from the
> provider so Postgres place holds the data with NULL until such time as
> the migration on the provider finishes and actual data for
> recap_sequence_number starts flowing.
>
> Going forward, options I see:
>
> 1) Making sure there is a DEFAULT other then NULL for a NOT NULL column.
>
> 2) Stop the replication and do the migration scripts on both provider
> and subscriber until they both complete and then start replication again.
>
>
>
>
> > Mike
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>

Re: Errors with schema migration and logical replication — expected?

2018-12-12 Thread Mike Lissner

Thanks Adrian for all the help. I filed this as bug #15549. I hope this all
helps get logical replication into the "Running" stage.

On Wed, Dec 12, 2018 at 5:06 PM Adrian Klaver 
wrote:

> On 12/12/18 3:19 PM, Mike Lissner wrote:
> > This sounds *very* plausible. So I think there are a few takeaways:
> >
> > 1. Should the docs mention that additive changes with NOT NULL
> > constraints are bad?
>
> It's not the NOT NULL it's the lack of a DEFAULT. In general a column
> with a NOT NULL and no DEFAULT is going to to bite you sooner or later:)
> At this point I have gathered enough of those bite marks to just make it
> my policy to always provide a DEFAULT for a NOT NULL column.
>
> >
> > 2. Is there a way this could work without completely breaking
> > replication? For example, should Postgresql realize replication can't
> > work in this instance and then stop it until schemas are back in sync,
> > like it does with other incompatible schema changes? That'd be better
> > than failing in this way and is what I'd expect to happen.
>
> Not sure as there is no requirement that a column has a specified
> DEFAULT. This is unlike PK and FK constraint violations where the
> relationship is spelled out. Trying to parse all the possible ways a
> user could get into trouble would require something on the order of an
> AI and I don't see that happening anytime soon.
>
> >
> > 3. Are there other edge cases like this that aren't well documented that
> > we can expect to creep up on us? If so, should we try to spell out
> > exactly *which* additive changes *are* OK?
>
> Not that I know of. By their nature edge cases are rare and often are
> dealt with in the moment and not pushed out to everybody. The only
> solution I know of is pretesting your schema change/replication setup on
> a dev installation.
>
> >
> > This feels like a major "gotcha" to me, and I'm trying to avoid those. I
> > feel like the docs are pretty lacking here and that others will find
> > themselves in similarly bad positions.
>
> Logical replication in core(not the pglogical extension) appeared for
> the first time in version 10. On the crawl/walk/run spectrum it is
> moving from crawl to walk. The docs will take some time to be more
> complete. Just for the record my previous post was sketching out a
> possible scenario not an ironclad answer. If you think the answer is
> plausible and a 'gotcha' I would file a bug:
>
> https://www.postgresql.org/account/login/?next=/account/submitbug/
>
> >
> > Better schema migration docs would surely help, too.
> >
> > Mike
> >
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>

Trying to understand a failed upgrade in AWS RDS

2023-05-19 Thread Mike Lissner

Hi all,

In AWS RDS, we are using logical replication between a postgresql 14
publisher and a postgresql 10 subscriber. The subscriber is rather old, so
yesterday I tried to update it using AWS's built in upgrade tool (which
uses pg_upgrade behind the scenes).

I did a pretty thorough test run before beginning, but the live run went
pretty poorly. My process was:

1. Disable the subscription to pg10.
2. Run RDS's upgrade (which runs pg_upgrade).
3. Re-Enable the subscription to the newly upgraded server.

The idea was that the publisher could still be live and collect changes,
and then on step 3, those changes would flush to the newly upgraded server.

When I hit step three, things went awry. From what I can tell, it seems
like pg_upgrade might have wiped out the LSN location of the subscriber,
because I was getting many messages in the logs saying:

2023-05-19 01:01:09
UTC:100.20.224.120(56536):django@courtlistener:[29669]:STATEMENT:
CREATE_REPLICATION_SLOT "pg_18278_sync_86449755_7234675743763347169"
LOGICAL pgoutput USE_SNAPSHOT2023-05-19 01:01:09
UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR:
replication slot "pg_18278_sync_16561_7234675743763347169" does not
exist2023-05-19 01:01:09
UTC:100.20.224.120(56550):django@courtlistener:[29670]:STATEMENT:
DROP_REPLICATION_SLOT pg_18278_sync_16561_7234675743763347169
WAIT2023-05-19 01:01:09
UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: all
replication slots are in use2023-05-19 01:01:09
UTC:100.20.224.120(56550):django@courtlistener:[29670]:HINT: Free one
or increase max_replication_slots.

I followed those instructions, and upped max_replication_slots to 200. That
fixed that error, but then I had errors about COPY commands failing, and
looking in the publisher I saw about 150 slots like:

select * from pg_replication_slots ;
 slot_name  |  plugin  | slot_type |
datoid |   database| temporary | active | active_pid | xmin |
catalog_xmin | restart_lsn  | confirmed_flush_lsn | wal_status |
safe_wal_size | two_phase
+--+---++---+---+++--+--+--+-++---+---
 pg_18278_sync_86449408_7234675743763347169 | pgoutput | logical   |
16428 | courtlistener | f | t  |   6906 |  |
859962500 | EA5/954A9F18 | | reserved   |
 | f
 pg_18278_sync_20492279_7234675743763347169 | pgoutput | logical   |
16428 | courtlistener | f | f  ||  |
859962448 | EA5/9548EDF0 | EA5/9548EE28| reserved   |
 | f
 pg_18278_sync_16940_7234675743763347169| pgoutput | logical   |
16428 | courtlistener | f | f  ||  |
859962448 | EA5/9548EE60 | EA5/9548EE98| reserved   |
 | f


So this looks like it's trying to sync all of the existing tables all over
again when I re-enabled the subscription.

Does that make sense? In the future, I'll DROP the subscription and then
create a new one with copy_data=False, but this was a real gotcha.

Anybody know what's going on here?

Thanks,

Mike

Re: Trying to understand a failed upgrade in AWS RDS

2023-05-19 Thread Mike Lissner

I also am realizing belatedly that my solution of dropping the subscriber
probably won't work anyway, since I'd lose the changes on the publisher for
the duration of the upgrade. Maybe I could drop the subscription while
keeping the slot on the publisher, and then create a new subscription after
the upgrade using that slot and copy_data=False? Getting wonky.

On Fri, May 19, 2023 at 8:17 AM Mike Lissner 
wrote:

> Hi all,
>
> In AWS RDS, we are using logical replication between a postgresql 14
> publisher and a postgresql 10 subscriber. The subscriber is rather old, so
> yesterday I tried to update it using AWS's built in upgrade tool (which
> uses pg_upgrade behind the scenes).
>
> I did a pretty thorough test run before beginning, but the live run went
> pretty poorly. My process was:
>
> 1. Disable the subscription to pg10.
> 2. Run RDS's upgrade (which runs pg_upgrade).
> 3. Re-Enable the subscription to the newly upgraded server.
>
> The idea was that the publisher could still be live and collect changes,
> and then on step 3, those changes would flush to the newly upgraded server.
>
> When I hit step three, things went awry. From what I can tell, it seems
> like pg_upgrade might have wiped out the LSN location of the subscriber,
> because I was getting many messages in the logs saying:
>
> 2023-05-19 01:01:09 
> UTC:100.20.224.120(56536):django@courtlistener:[29669]:STATEMENT: 
> CREATE_REPLICATION_SLOT "pg_18278_sync_86449755_7234675743763347169" LOGICAL 
> pgoutput USE_SNAPSHOT2023-05-19 01:01:09 
> UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: replication 
> slot "pg_18278_sync_16561_7234675743763347169" does not exist2023-05-19 
> 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:STATEMENT: 
> DROP_REPLICATION_SLOT pg_18278_sync_16561_7234675743763347169 WAIT2023-05-19 
> 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: all 
> replication slots are in use2023-05-19 01:01:09 
> UTC:100.20.224.120(56550):django@courtlistener:[29670]:HINT: Free one or 
> increase max_replication_slots.
>
> I followed those instructions, and upped max_replication_slots to 200.
> That fixed that error, but then I had errors about COPY commands failing,
> and looking in the publisher I saw about 150 slots like:
>
> select * from pg_replication_slots ;
>  slot_name  |  plugin  | slot_type | datoid | 
>   database| temporary | active | active_pid | xmin | catalog_xmin | 
> restart_lsn  | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase 
> +--+---++---+---+++--+--+--+-++---+---
>  pg_18278_sync_86449408_7234675743763347169 | pgoutput | logical   |  16428 | 
> courtlistener | f | t  |   6906 |  |859962500 | 
> EA5/954A9F18 | | reserved   |   | f
>  pg_18278_sync_20492279_7234675743763347169 | pgoutput | logical   |  16428 | 
> courtlistener | f | f  ||  |859962448 | 
> EA5/9548EDF0 | EA5/9548EE28| reserved   |   | f
>  pg_18278_sync_16940_7234675743763347169| pgoutput | logical   |  16428 | 
> courtlistener | f | f  ||  |859962448 | 
> EA5/9548EE60 | EA5/9548EE98| reserved   |   | f
>
>
> So this looks like it's trying to sync all of the existing tables all over
> again when I re-enabled the subscription.
>
> Does that make sense? In the future, I'll DROP the subscription and then
> create a new one with copy_data=False, but this was a real gotcha.
>
> Anybody know what's going on here?
>
> Thanks,
>
> Mike
>

Re: Trying to understand a failed upgrade in AWS RDS

2023-05-19 Thread Mike Lissner

Thanks for the suggestions. I think in the future I'll do something like
this rather than try to re-use existing subscriptions.

I'm still trying to understand what went wrong though. Putting a finer
point on my question: Does pg_upgrade mess up disabled subscriptions?

On Fri, May 19, 2023 at 1:55 PM Elterman, Michael 
wrote:

> Please, use the following runbook.
> 1. Disable the subscription to pg10.
> 2. Disable Application Users on Publisher.
> 3. Drop all replication slots on Publisher (The upgrade can not be
> executed if there are any replication slots)
> 4. Run RDS's upgrade (which runs pg_upgrade).
> 5. Recreate replication slots with the same names.
> 6. Enable Application Users on Publisher.
> 7. Re-Enable the subscriptions to the newly upgraded server.
> Good luck
>
> On Fri, May 19, 2023 at 11:49 AM Mike Lissner <
> mliss...@michaeljaylissner.com> wrote:
>
>> I also am realizing belatedly that my solution of dropping the subscriber
>> probably won't work anyway, since I'd lose the changes on the publisher for
>> the duration of the upgrade. Maybe I could drop the subscription while
>> keeping the slot on the publisher, and then create a new subscription after
>> the upgrade using that slot and copy_data=False? Getting wonky.
>>
>> On Fri, May 19, 2023 at 8:17 AM Mike Lissner <
>> mliss...@michaeljaylissner.com> wrote:
>>
>>> Hi all,
>>>
>>> In AWS RDS, we are using logical replication between a postgresql 14
>>> publisher and a postgresql 10 subscriber. The subscriber is rather old, so
>>> yesterday I tried to update it using AWS's built in upgrade tool (which
>>> uses pg_upgrade behind the scenes).
>>>
>>> I did a pretty thorough test run before beginning, but the live run went
>>> pretty poorly. My process was:
>>>
>>> 1. Disable the subscription to pg10.
>>> 2. Run RDS's upgrade (which runs pg_upgrade).
>>> 3. Re-Enable the subscription to the newly upgraded server.
>>>
>>> The idea was that the publisher could still be live and collect changes,
>>> and then on step 3, those changes would flush to the newly upgraded server.
>>>
>>> When I hit step three, things went awry. From what I can tell, it seems
>>> like pg_upgrade might have wiped out the LSN location of the subscriber,
>>> because I was getting many messages in the logs saying:
>>>
>>> 2023-05-19 01:01:09 
>>> UTC:100.20.224.120(56536):django@courtlistener:[29669]:STATEMENT: 
>>> CREATE_REPLICATION_SLOT "pg_18278_sync_86449755_7234675743763347169" 
>>> LOGICAL pgoutput USE_SNAPSHOT2023-05-19 01:01:09 
>>> UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: replication 
>>> slot "pg_18278_sync_16561_7234675743763347169" does not exist2023-05-19 
>>> 01:01:09 UTC:100.20.224.120(56550):django@courtlistener:[29670]:STATEMENT: 
>>> DROP_REPLICATION_SLOT pg_18278_sync_16561_7234675743763347169 
>>> WAIT2023-05-19 01:01:09 
>>> UTC:100.20.224.120(56550):django@courtlistener:[29670]:ERROR: all 
>>> replication slots are in use2023-05-19 01:01:09 
>>> UTC:100.20.224.120(56550):django@courtlistener:[29670]:HINT: Free one or 
>>> increase max_replication_slots.
>>>
>>> I followed those instructions, and upped max_replication_slots to 200.
>>> That fixed that error, but then I had errors about COPY commands failing,
>>> and looking in the publisher I saw about 150 slots like:
>>>
>>> select * from pg_replication_slots ;
>>>  slot_name  |  plugin  | slot_type | datoid 
>>> |   database| temporary | active | active_pid | xmin | catalog_xmin | 
>>> restart_lsn  | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase 
>>> +--+---++---+---+++--+--+--+-++---+---
>>>  pg_18278_sync_86449408_7234675743763347169 | pgoutput | logical   |  16428 
>>> | courtlistener | f | t  |   6906 |  |859962500 | 
>>> EA5/954A9F18 | | reserved   |   | f
>>>  pg_18278_sync_20492279_7234675743763347169 | pgoutput | logical   |  16428 
>>> | courtlistener | f | f  ||  |859962448 | 
>>> EA5/9548EDF0 | EA5/9548EE28| reserved   |   | f
>>>  pg_18278_sync_16940_7234675743763347169| pgoutput | logical   |  16428 
>>> | courtlistener | f | f  ||  |859962448 | 
>>> EA5/9548EE60 | EA5/9548EE98| reserved   |   | f
>>>
>>>
>>> So this looks like it's trying to sync all of the existing tables all
>>> over again when I re-enabled the subscription.
>>>
>>> Does that make sense? In the future, I'll DROP the subscription and then
>>> create a new one with copy_data=False, but this was a real gotcha.
>>>
>>> Anybody know what's going on here?
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>

Re: Trying to understand a failed upgrade in AWS RDS

2023-05-21 Thread Mike Lissner

> As far as I know it's impossible to reliably pg_upgrade a node that has
> subscriptions and eventually resume logical replication.
>

Should this go in the documentation somewhere? Maybe in the pg_upgrade
notes? I still don't understand the mechanism. You also say that:


> It's possible to make it work with some efforts in some basic
> configurations and / or if no changes happen on the publications
>

But that kind of surprises me too, actually, because it seemed like
pg_upgrade wiped out the LSN locations of the subcriber, making it start
all over.

Upgrading a subscriber seems like something that could/should work, so it
should be documented if pg_upgrade is incompatible with maintaining a
subscription, shouldn't it?

Local replication "slot does not exist" after initial sync

2024-02-25 Thread Mike Lissner

Hi, I set up logical replication a few days ago, but it's throwing some
weird log lines that have me worried. Does anybody have experience with
lines like the following on a subscriber:

LOG: logical replication table synchronization worker for subscription
"compass_subscription", table "search_opinionscitedbyrecapdocument" has
started
ERROR: could not start WAL streaming: ERROR: replication slot
"pg_20031_sync_17418_7324846428853951375" does not exist
LOG: background worker "logical replication worker" (PID 1014) exited with
exit code 1

Slots with this kind of name (pg_xyz_sync_*) are created during the initial
sync, but it seems like the subscription is working based on a quick look
in a few tables.

I thought this might be related to running out of slots on the publisher,
so I increased both max_replication_slots and max_wal_senders to 50 and
rebooted so those would take effect. No luck.

I thought rebooting the subscriber might help. No luck.

When I look in the publisher to see the slots we have...

SELECT * FROM pg_replication_slots;

...I do not see the one that's missing according to the log lines.

So it seems like the initial sync might have worked properly (tables have
content), but that I have an errant process on the subscriber that might be
stuck in a retry loop.

I haven't been able to fix this, and I think my last attempt might be a new
subscription with copy_data=false, but I'd rather avoid that if I can.

Is there a way to fix or understand this so that I don't get the log lines
forever and so that I can be confident the replication is in good shape?

Thank you!


Mike

Re: Local replication "slot does not exist" after initial sync

2024-02-25 Thread Mike Lissner

Sorry, two more little things here. The publisher logs add much, but here's
what we see:

STATEMENT: START_REPLICATION SLOT "pg_20031_sync_17418_7324846428853951375"
LOGICAL F1D0/346C6508 (proto_version '2', publication_names
'"compass_publication2"')
ERROR: replication slot "pg_20031_sync_17402_7324846428853951375" does not
exist

And I thought that maybe there'd be some magic in the REFRESH command on
the subscriber, so I tried that:

alter subscription xyz refresh publication;


To nobody's surprise, that didn't help. :)


On Sun, Feb 25, 2024 at 10:00 AM Mike Lissner <
mliss...@michaeljaylissner.com> wrote:

> Hi, I set up logical replication a few days ago, but it's throwing some
> weird log lines that have me worried. Does anybody have experience with
> lines like the following on a subscriber:
>
> LOG: logical replication table synchronization worker for subscription
> "compass_subscription", table "search_opinionscitedbyrecapdocument" has
> started
> ERROR: could not start WAL streaming: ERROR: replication slot
> "pg_20031_sync_17418_7324846428853951375" does not exist
> LOG: background worker "logical replication worker" (PID 1014) exited with
> exit code 1
>
> Slots with this kind of name (pg_xyz_sync_*) are created during the
> initial sync, but it seems like the subscription is working based on a
> quick look in a few tables.
>
> I thought this might be related to running out of slots on the publisher,
> so I increased both max_replication_slots and max_wal_senders to 50 and
> rebooted so those would take effect. No luck.
>
> I thought rebooting the subscriber might help. No luck.
>
> When I look in the publisher to see the slots we have...
>
> SELECT * FROM pg_replication_slots;
>
> ...I do not see the one that's missing according to the log lines.
>
> So it seems like the initial sync might have worked properly (tables have
> content), but that I have an errant process on the subscriber that might be
> stuck in a retry loop.
>
> I haven't been able to fix this, and I think my last attempt might be a
> new subscription with copy_data=false, but I'd rather avoid that if I can.
>
> Is there a way to fix or understand this so that I don't get the log lines
> forever and so that I can be confident the replication is in good shape?
>
> Thank you!
>
>
> Mike
>

Logical replication DNS cache

How to shorten a chain of logically replicated servers

Re: How to shorten a chain of logically replicated servers

Re: How to shorten a chain of logically replicated servers

Re: How to shorten a chain of logically replicated servers

Is it safe to transfer logical replication publication/subscription?

Re: Is it safe to transfer logical replication publication/subscription?

Re: Is it safe to transfer logical replication publication/subscription?

Does converting an indexed varchar to text rewrite its index? Docs say so, tests say no.

Re: Does converting an indexed varchar to text rewrite its index? Docs say so, tests say no.

Re: Does converting an indexed varchar to text rewrite its index? Docs say so, tests say no.

Better documentation for schema changes in logical replication

Errors with schema migration and logical replication — expected?

Re: Errors with schema migration and logical replication — expected?

Re: Errors with schema migration and logical replication — expected?

Re: Errors with schema migration and logical replication — expected?

Re: Errors with schema migration and logical replication — expected?

Re: Errors with schema migration and logical replication — expected?

Re: Errors with schema migration and logical replication — expected?

Re: Errors with schema migration and logical replication — expected?

Trying to understand a failed upgrade in AWS RDS

Re: Trying to understand a failed upgrade in AWS RDS

Re: Trying to understand a failed upgrade in AWS RDS

Re: Trying to understand a failed upgrade in AWS RDS

Local replication "slot does not exist" after initial sync

Re: Local replication "slot does not exist" after initial sync

26 matches

Site Navigation

Mail list logo

Footer information