Slow shutdowns sometimes on RDS Postgres

2018-09-13 Thread Chris Williams
Hi,

I'm using AWS RDS Postgres (9.6.6) and have run into very slow shutdowns
(10+ minutes) a few times when making database modifications (e.g. reboot,
changing instance size, etc.).  Other times, it shuts down quickly (1
minute or so).  I have not been able to figure out why sometimes it takes a
long time to shutdown.

When it happens, I see a bunch of lines in the postgres log like the
following over and over (almost every second or two) during this 10 minute
shutdown period:
2018-09-12 06:37:01 UTC:XXX.XXX.XXX.XXX(19712):my_user@my_db
:[16495]:FATAL:
2018-09-12 06:37:01 UTC:localhost(31368):rdsadmin@rdsadmin:[16488]:FATAL:
the database system is shutting down

Once I start seeing these messages, I start manually shutting down all of
our applications that are connected to the db.  I'm not sure if shutting
down the apps fixes it or if there's some timeout on the RDS side, but it
seems like once I start doing this, the database finally shuts down.

When it takes this long to shut down, it ends up causing a lot more
downtime than I would like.  I've tried asking AWS's support why it takes
so long to shutdown sometimes, but they basically just told me that's "how
it works" and that I should try to shut down all of my connections ahead of
time before making database modifications.

We just have a few ruby on rails applications connected to the database,
and don't really have any long running or heavy queries and the db is under
very light load, so I don't understand why it takes so long to shutdown.
We do have a sizeable number of connections though (about 600) and there
are two replicas connected to it.  I also tried
setting idle_in_transaction_session_timeout to 300 seconds to see if that
would help, but it made no difference.

I was wondering if anyone else had seen this behavior on their RDS Postgres
instances or had any suggestions on how I could shorten the shutdown time?

Thanks,
Chris


Re: Slow shutdowns sometimes on RDS Postgres

2018-09-13 Thread Chris Williams
Yeah, I figured that.  Unfortunately, every time it happens, I open a
support ticket with them, but they always just tell me that this is normal
behavior for postgres.  Whether it's "normal" or not, I really would like
to get my db to shut down faster, and their suggestion of manually shutting
down all of my apps ahead of time is a real headache.  Given that I haven't
gotten anywhere with their support, I figured I'd try asking on the mailing
list.

Thanks,
Chris

On Thu, Sep 13, 2018 at 3:17 PM Joshua D. Drake 
wrote:

> On 09/13/2018 03:04 PM, Chris Williams wrote:
> > Hi,
> >
> > I'm using AWS RDS Postgres (9.6.6) and have run into very slow
> > shutdowns (10+ minutes) a few times when making database modifications
> > (e.g. reboot, changing instance size, etc.).  Other times, it shuts
> > down quickly (1 minute or so).  I have not been able to figure out why
> > sometimes it takes a long time to shutdown.
>
> This is probably something you would have to talk to Amazon about. AWS
> RDS Postgres is a fork of PostgreSQL and not 100% compatible from an
> administrative perspective.
>
> JD
>
>
>
> >
> > When it happens, I see a bunch of lines in the postgres log like the
> > following over and over (almost every second or two) during this 10
> > minute shutdown period:
> > 2018-09-12 06:37:01
> > UTC:XXX.XXX.XXX.XXX(19712):my_user@my_db:[16495]:FATAL:
> > 2018-09-12 06:37:01
> > UTC:localhost(31368):rdsadmin@rdsadmin:[16488]:FATAL: the database
> > system is shutting down
> >
> > Once I start seeing these messages, I start manually shutting down all
> > of our applications that are connected to the db.  I'm not sure if
> > shutting down the apps fixes it or if there's some timeout on the RDS
> > side, but it seems like once I start doing this, the database finally
> > shuts down.
> >
> > When it takes this long to shut down, it ends up causing a lot more
> > downtime than I would like.  I've tried asking AWS's support why it
> > takes so long to shutdown sometimes, but they basically just told me
> > that's "how it works" and that I should try to shut down all of my
> > connections ahead of time before making database modifications.
> >
> > We just have a few ruby on rails applications connected to the
> > database, and don't really have any long running or heavy queries and
> > the db is under very light load, so I don't understand why it takes so
> > long to shutdown.  We do have a sizeable number of connections though
> > (about 600) and there are two replicas connected to it.  I also tried
> > setting idle_in_transaction_session_timeout to 300 seconds to see if
> > that would help, but it made no difference.
> >
> > I was wondering if anyone else had seen this behavior on their RDS
> > Postgres instances or had any suggestions on how I could shorten the
> > shutdown time?
> >
> > Thanks,
> > Chris
> >
>
> --
> Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
> ***  A fault and talent of mine is to tell it exactly how it is.  ***
> PostgreSQL centered full stack support, consulting and development.
> Advocate: @amplifypostgres || Learn: https://postgresconf.org
> * Unless otherwise stated, opinions are my own.   *
>
>


Re: Slow shutdowns sometimes on RDS Postgres

2018-09-14 Thread Chris Williams
Hey Jeremy,

Thanks so much for your response. That's a great tip regarding enabling
enhanced monitoring.  Maybe it will give some insight into which particular
backends are causing the shutdown to hang.  One interesting data point when
this happened is that in cloudwatch, once the database started shutting
down, you can see the connection count drop from 600 to 4 immediately.
Then the graph shows a constant 4 connections for 10 minutes straight
before it finally shuts down and goes to 0.  I'm guessing the shutdown is
hanging because of one of these 4 connections.  Unfortunately, because the
database is shutting down, I'm not able to connect and get any info about
these connections, so enabling enhanced monitoring might give some more
clues.

My other question is, do you think shutting down my apps manually once I
noticed the shutdown was hanging had any effect on the total shutdown time?
It seems a bit coincidental that the database finally shut down after
exactly 10 minutes.  This makes me think that some internal timeout in
RDS's shutdown script was triggered and that shutting down my apps didn't
actually affect anything. I'd much rather just wait 10 minutes then
frantically try to shutdown all of my apps.  So I'd be curious to know if
you are able to look up what the timeout is and if you think the timeout is
what is actually causing it to finally shut down. While I'd much rather
have my database shutdown in a minute or two, at least there would be some
comfort in knowing that 10 minutes is the upper bound.

I'd love to be able to figure out how to reproduce it, but it doesn't
happen consistently unfortunately and I've only ever seen it on our main
production database so that makes things trickier.  We do need to resize
our database again in a few weeks.  If there's any special debug settings i
should enable before modifying it (besides enhanced monitoring) or if
support wanted to observe it in the middle of the shutdown, let me know.
The last two times I've resized the database, the shutdown has hung like
this, so there's a good chance it will do it again.

Another idea I had too was to remove all the security group rules on the db
right before applying the modification.  My thinking was that maybe that
might help postgres terminate all the connections quicker? That said, it's
all a shot in the dark I guess since we don't know the underlying cause.

Best,
Chris

On Fri, Sep 14, 2018 at 8:43 AM Jeremy Schneider 
wrote:

> Hi Chris - this is an interesting one that we do see from time to time;
> seems worth responding here as actually our best understanding right now
> is that this is something in community code, not AWS-specific.
>
>
> On 9/13/18 15:17, Joshua D. Drake wrote:
> > This is probably something you would have to talk to Amazon about. AWS
> > RDS Postgres is a fork of PostgreSQL and not 100% compatible from an
> > administrative perspective.
>
> Hey JD! FWIW, the code differences are pretty minimal and generally just
> what's required to have a managed service where people can still use the
> database as they normally would. The biggest difference is just getting
> used to operating without direct OS access, and working through
> automation/tooling instead. (And there's always EC2 for the many
> customers who want/need superuser but still don't want to maintain the
> hardware.)
>
>
> On 9/13/18 16:10, Adrian Klaver wrote:
> > The thing is, what you are doing ("(e.g. reboot, changing instance
> > size, etc.)") are instance operations not database operations. That
> > comes under AWS's purview.
>
> Correct, managing reboots and hardware reconfigurations would be the
> responsibility of AWS. However Chris' issue here is just that PostgreSQL
> itself took a long time to shut down. I'm not aware of anything
> RDS-specific with this.
>
>
> > On 09/13/2018 03:04 PM, Chris Williams wrote:
> >> I'm using AWS RDS Postgres (9.6.6) and have run into very slow
> >> shutdowns (10+ minutes) a few times when making database modifications
> >> (e.g. reboot, changing instance size, etc.).  Other times, it shuts
> >> down quickly (1 minute or so).  I have not been able to figure out why
> >> sometimes it takes a long time to shutdown.
>
> I don't know about this specific incident, but I do know that the RDS
> team has seen cases where a backend gets into a state (like a system
> call) where it's not checking signals and thus doesn't receive or
> process the postmaster's request to quit. We've seen these processes
> delay shutdowns and also block recovery on streaming replicas.
>
>
> >> Once I start seeing these messages, I start manually shutting down all
> >> of 

Unsynchronized parallel dumps from 13.3 replica produced by pg_dump

2021-10-18 Thread Chris Williams
Hi,

We have a script that runs a pg_dump off of an RDS PG13.3 replica several
times per day. We then load this dump using pg_restore into another
postgres RDS db in another AWS account, scrub some of the data, and then
take a snapshot of it. We then use these snapshots for testing and staging
servers. This script has been running fine for years without any issues.
However, a little less than a month ago, we changed the pg_dump command in
the script to use the "-j" flag to speed up our dumps. It was my
understanding that starting with postgres v10, this option should now be
supported on replicas. Since we recently upgraded to Postgres 13.3 from
9.6, we thought it would be a good idea to start using this feature to
speed up our pg_dumps. In order to use parallel dumps, we made two changes
to our script: 1) We swapped to this command: pg_dump -Fd -j9 -Z0 -x
-hreplica -Uuser -f dump_dir For reference, the previous command (which
never had any problems) was: pg_dump -Fc -Z0 -x -hreplica -Uuser > dump 2)
We added code to resize our replica db to a larger hardware size before
starting the pg_dump so we can utilize more parallelism. After resizing the
db, the script waits for our replica to go back to the "available" state
before running the pg_dump command. When the pg_dump finishes, we then
resize the db back down to a t3.medium for cost reasons. When we first
rolled this out, the speed improvement was fantastic, it brought our
pg_dump down from 30 minutes to 5. However, within a week or so of rolling
this out, we've noticed that sometimes the dumps produced by our pg_dump
command were not consistent/synchronized. Sometimes when we would restore
the backup using pg_restore to our intermediate db it would throw duplicate
key errors or other types of constraint errors. Sometimes, even when the
pg_restore fully restored without errors, we noticed that some of the
sequences were wrong (we saw some sequences with a value less than the max
id in the column, so new inserts would start failing on the table). The
errors were never consistent either. It appears to me that the dump
produced by our pg_dump is not actually synchronized. I checked the
postgres logs on the server and don't see anything noteworthy. We also are
not getting any errors from our pg_dump command. Everything we are using is
version 13.3 (all of the dbs involved and the server running the script is
using v13.3 of pg_dump/pg_restore commands) I was curious if anyone had any
thoughts on what could be causing this. From my reading of the pg_dump
manual, it seems like it should always be producing a consistent dump, but
please correct me if I'm misunderstanding. I've also opened a ticket with
AWS's support to see if they have any ideas. For now, we've had to go back
to using the single threaded pg_dump (which is disappointing because it's
25 minutes slower). We were really hoping to take advantage of the
parallelism.

Any ideas would be much appreciated.

Thanks,
Chris


Re: Unsynchronized parallel dumps from 13.3 replica produced by pg_dump

2021-10-18 Thread Chris Williams
Thanks Tom.  It's a strange one for sure.  Hopefully AWS support will shed
some light on it. I will clarify too that this is the regular RDS Postgres
version and not their other Aurora Postgres service.  I suspect the Aurora
Postgres probably differs from the community version by quite a bit, but
I'm unsure how much their regular Postgres offering differs, if at all.

Thanks,
Chris

On Mon, Oct 18, 2021 at 8:05 PM Tom Lane  wrote:

> Chris Williams  writes:
> > We have a script that runs a pg_dump off of an RDS PG13.3 replica several
> > times per day. We then load this dump using pg_restore into another
> > postgres RDS db in another AWS account, scrub some of the data, and then
> > take a snapshot of it.
>
> Hmm ... I'm fairly sure that RDS Postgres is not Postgres at this level
> of detail.  The info I've been able to find about their replication
> mechanism talks about things like "eventually consistent reads", which
> is not something community Postgres deals in.
>
> In particular, what I'd expect from the community code is that a replica
> could see a sequence as being *ahead* of the value that you might expect
> from looking at related tables; but never behind.  (Also, that statement
> is true regardless of whether you are doing parallel dump.)  And
> non-sequence tables should always be consistent, period.
>
> So I'm suspicious that this is an RDS-specific effect, and thus that
> you should consult Amazon support first.  If they say "no, it's Postgres
> all the way down", then we need to look closer.
>
> regards, tom lane
>