What to do when dynamic shared memory control segment is corrupt

2018-06-18 Thread Sherrylyn Branchaw
Greetings,

We are using Postgres 9.6.8 (planning to upgrade to 9.6.9 soon) on RHEL 6.9.

We recently experienced two similar outages on two different prod
databases. The error messages from the logs were as follows:

LOG:  server process (PID 138529) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.

We are still investigating what may have triggered these errors, since
there were no recent changes to these databases. Unfortunately, core dumps
were not configured correctly, so we may have to wait for the next outage
before we can do a good root cause analysis.

My question, meanwhile, is around remedial actions to take when this
happens.

In one case, the logs recorded

LOG:  all server processes terminated; reinitializing
LOG:  incomplete data in "postmaster.pid": found only 1 newlines while
trying to add line 7
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 365/FDFA738
LOG:  invalid record length at 365/12420978: wanted 24, got 0
LOG:  redo done at 365/12420950
LOG:  last completed transaction was at log time 2018-06-05
10:59:27.049443-05
LOG:  checkpoint starting: end-of-recovery immediate
LOG:  checkpoint complete: wrote 5343 buffers (0.5%); 0 transaction log
file(s) added, 1 removed, 0 recycled; write=0.131 s, sync=0.009 s,
total=0.164 s; sync files=142, longest=0.005 s, average=0.000 s;
distance=39064 kB, estimate=39064 kB
LOG:  MultiXact member wraparound protections are now enabled
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections

In that case, the database restarted immediately, with only 30 seconds of
downtime.

In the other case, the logs recorded

LOG:  all server processes terminated; reinitializing
LOG:  dynamic shared memory control segment is corrupt
LOG:  incomplete data in "postmaster.pid": found only 1 newlines while
trying to add line 7

In that case, the database did not restart on its own. It was 5 am on
Sunday, so the on-call SRE just manually started the database up, and it
appears to have been running fine since.

My question is whether the corrupt shared memory control segment, and the
failure of Postgres to automatically restart, mean the database should not
be automatically started up, and if there's something we should be doing
before restarting.

Do we potentially have corrupt data or indices as a result of our last
outage? If so, what should we do to investigate?

Thanks,
Sherrylyn


Re: What to do when dynamic shared memory control segment is corrupt

2018-06-18 Thread Sherrylyn Branchaw
> Hm ... were these installations built with --enable-cassert?  If not,
> an abort trap seems pretty odd.

The packages are installed directly from the yum repos for RHEL. I'm not
aware that --enable-cassert is being used, and we're certainly not
installing from source.

> Those "incomplete data" messages are quite unexpected and disturbing.
> I don't know of any mechanism within Postgres proper that would result
> in corruption of the postmaster.pid file that way.  (I wondered briefly
> if trying to start a conflicting postmaster would result in such a
> situation, but experimentation here says not.)  I'm suspicious that
> this may indicate a bug or unwarranted assumption in whatever scripts
> you use to start/stop the postmaster.  Whether that is at all related
> to your crash issue is hard to say, but it bears looking into.

We're using the stock initd script from the yum repo, but I dug into this
issue, and it looks like we're passing the path to the postmaster.pid as
the $pidfile variable in our sysconfig file, meaning the initd script is
managing the postmaster.pid file, and specifically is overwriting it with a
single line containing just the pid. I'm not sure why it's set up like
this, and I'm thinking we should change it, but it seems harmless and
unrelated to the crash. In particular, manual initd actions such as stop,
start, restart, and status all work fine.

> No, that looks like fairly typical crash recovery to me: corrupt shared
> memory contents are expected and recovered from after a crash.

That's reassuring. But if it's safe for us to immediately start the server
back up, why did Postgres not automatically start the server up like it did
the first time? I was assuming it was due to the presence of the corrupt
memory segment, as that was the only difference in the logs, although I
could be wrong. Automatic restart would have saved us a great deal of
downtime; since in the first case we had total recovery within 30 seconds,
and in the second case, many minutes of downtime while someone got paged,
troubleshot the issue, and eventually decided to try starting the database
back up.

At any rate, if it's safe, we can write a script to detect this failure
mode and automatically restart, although it would be less error-prone if
Postgres restarted automatically.

> Hm, I supposed that Sherrylyn would've noticed any PANIC entries in
> the log.

No PANICs. The log lines I pasted were the only ones that looked relevant
in the Postgres logs. I can try to dig through the application logs, but I
was planning to wait until the next time this happens, since we should have
core dumps fixed and that might make things easier.

> What extensions are installed, if any?

In the first database, the one without the corrupt memory segment and that
restarted automatically: plpgsql and postgres_fdw.

In the second database, the one where the memory segment got corrupted and
that didn't restart automatically: dblink, hstore, pg_trgm, pgstattuple,
plpgsql, and tablefunc.

I forgot to mention that the queries that got killed were innocuous-looking
SELECTs that completed successfully for me in less than a second when I ran
them manually. In other words, the problem was not reproducible.

Sherrylyn


Re: What to do when dynamic shared memory control segment is corrupt

2018-06-19 Thread Sherrylyn Branchaw
Yeah, I'd like to know that too.  The complaint about corrupt shared
memory may be just an unrelated red herring, or it might be a separate
effect of whatever the primary failure was ... but I think it was likely
not the direct cause of the failure-to-restart.

Anyway, I would not be afraid to try restarting the postmaster manually
if it died.  Maybe don't do that repeatedly without human intervention;
but PG is pretty robust against crashes.  We developers crash it all the
time, and we don't lose data.

Understood, and thanks. I was basing my concern on a message in the mailing
lists that suggested that postgres might fail to start up in the event of a
corrupted memory segment. I would link to the message directly, but I keep
getting backend server error messages when I try to search for it today. At
any rate, it looked there was a chance that it was a deliberate design
choice, and I didn't want to ignore it if so. It's good to know that this
is not the case.

I realize that you're most focused on less-downtime, but from my
perspective it'd be good to worry about collecting evidence as to
what happened exactly.

Absolutely. I would love to know why this is happening too. However, our
priorities have been set in part by a very tight deadline handed down from
the C-levels to migrate to Aurora, so we have to focus our energies
accordingly. I will be back with core files if this happens again before
we're completely migrated over. Meanwhile, thank you for assuring me we
have no current data corruption and that it's safe to restart next time
without taking additional action to avoid or detect corruption.

Best,
Sherrylyn


Slow WAL recovery for DROP TABLE

2018-07-17 Thread Sherrylyn Branchaw
We are running Postgres 9.6.9 on RHEL 6.9. We're using built-in streaming
replication, with a WAL archive for fallback purposes. No logical
replication.

We recently had a bug in our code accidentally create several hundred
thousand tables in a single database. A batch job started cleaning them up
one at a time (separate transactions), over the course of a couple nights.

But what we noticed the next day was that the 2 standbys to this database
were falling further and further behind. They were no longer streaming,
instead they were recovering from the archive, and they were doing so very
slowly.

Only a small amount of WAL was generated by the DROP TABLEs, relative to
normal traffic, but the time to recover a single WAL file sometimes took
over a minute, compared to 1 second normally. Meanwhile, new WALs were
being generated by regular user traffic, and the database got 24 hours
behind and getting worse, before we resynced the standbys, in order to not
have to replay the WALs. That did the trick. But it seemed like this
performance drop was drastic enough to be worth reporting.

I was able to reproduce the slow recovery time of WALs generated by DROP
TABLE on a pair of test servers.

Normal writes (INSERT, UPDATE, DELETE, CREATE TABLE) generated WAL files
that were recoverable on the order of 1 second each. The standby never fell
behind, even when hundreds of WAL files were generated quickly. But when I
generated WALs exclusively from DROP TABLE, it took 20-80 seconds to
recover each one.

Some benchmarks:

For 10 minutes, I issued a bunch of rapid-fire INSERT statements to
server1, the primary. These generated 144 WAL files. At the end of the 10
minutes, server2, the standby, had already recovered 143 of them and
immediately completed the last one.

For 10 minutes, I issued a bunch of rapid-fire DROP TABLE statements to
server1, the primary. These generated 28 WAL files. At the end of the 10
minutes, server2, the standby, had only recovered 16 of them.

My expectation is that replaying binary changes on a standby should be at
least as fast and sometimes faster than generating them locally. This is
because when a logical command like "DROP TABLE" or "INSERT" is issued on a
primary, the database has to check for dependencies, satisfy constraints,
etc. But once all those decisions are made on the primary, the binary
changes should be replayable much faster on the standby.

For example, server1, the primary in my test pair, has significantly better
hardware than server2, the standby. If server2 is keeping up with server1
in replication, I predict that server2 should be able to get through the
same amount of writes faster as a replica than as a standalone.

This expectation was borne out in my testing for INSERTs but not DROP
TABLEs, per the following benchmarks.

I issued 9,000,000 INSERTs to server1, the primary. It took 10 minutes to
complete these. At the end of the 10 minutes, server2, the standby, was
caught up. When I made server2 a standalone, it took 15 minutes to get
through the same 9,000,000 INSERTs. In conclusion, being a standalone was
50% slower for server2 when it came to INSERT.

I issued 100,000 DROP TABLEs to server1, the primary. It took 10 minutes to
complete these on server 1, and 20 minutes for server2, the standby, to
catch up. When I made server2 a standalone, it took 18 minutes to get
through the same 100,000 DROP TABLEs. In conclusion, being a standalone was
10% faster for server2 when it came to DROP TABLE.

It seems that there is something especially inefficient about WAL files
generated by DROP TABLE, or the process of recovering them.

I also want to mention that recovering a WAL file generated by DROP TABLE
commands consistently takes approximately 100% CPU on the standby, both in
testing and in prod. In prod, we also saw high memory use for that process.

The standby with 128 GB of RAM showed this:

PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND


23279 postgres  20   0 57.0g  50g 5.1g R 100.0 40.0   1723:15 postgres:
startup process   recovering 000203B1008D

The standby with 32 GB of RAM crashed overnight with an out-of-memory error.

When I restarted the servers and let them resume recovering WALs, I watched
%MEM and RES grow slowly and steadily before I gave up and resynced. No
memory appeared to be freed up after each WAL file was replayed.

I was not able to reproduce memory usage getting quite this high in the
time I was allotted for testing, but I did witness steady climbing up to
almost 20%, especially when I mixed DROP TABLEs with regular writes. I
suspect there is a sweet (or bitter?) spot of write activity we hit in our
prod workload that I was unable to reproduce in testing.

I'm not sure if this is the right mailing list to report this behavior to,
but I thought it was the best place to start. Let me know if there's any
other information I should provide, or if I should send it to another list.

Thanks,
Sherrylyn


Re: Slow WAL recovery for DROP TABLE

2018-07-18 Thread Sherrylyn Branchaw
> There was a recent commit for a similar performance problem, which will
appear in 9.6.10.  But that was specifically for cases where there were
multiple dropped tables per transaction, and large shared_buffers.

Interesting, and good to know, thanks! I'm not sure we fall under either
(is 8 GB large? It's larger than the default, but I always thought large
was defined as "more than 8GB" for this setting), but it sounds like this
sort of problem is on the developers' radars. It's possible up to 4 tables
were dropped per transaction in prod, but I don't know if that's enough to
count as "multiple", and in testing, I reproduced the problem with 1 drop
per transaction.

> I can't reproduce your single-drop-per-transaction problem.  The replica
has no problem keeping up with the master.

It's possible that the problem only occurs when the replica is on inferior
hardware. I was unable to test equal servers in the time I had. I noticed
that when the superior server was the replica, it was able to keep up with
the inferior replica, but that dropping tables was the only action for
which the inferior server wasn't able to keep up with as a standby, and the
only action for which the standalone outperformed the replica. I did not
test truncates; it's possible I would have seen the same problem with it.

> Can you share the reproduction scripts

For the table drops, I prepped by running these:

CREATE TABLE IF NOT EXISTS test1 (id int);
CREATE TABLE IF NOT EXISTS test2 (id int);
CREATE TABLE IF NOT EXISTS test3 (id int);
...
CREATE TABLE IF NOT EXISTS test10 (id int);

Then I dropped with these:

DROP TABLE IF EXISTS test1;
DROP TABLE IF EXISTS test2;
DROP TABLE IF EXISTS test3;
...
DROP TABLE IF EXISTS test10;

For the inserts, I did a

CREATE TABLE test (content1 TEXT, content2 TEXT, content3 TEXT);

followed by 300 of these statements, which insert a random string into each
of the three columns, in batches of 1 rows:

INSERT INTO test SELECT array_to_string(ARRAY(SELECT chr((48 +
round(random() * 9)) :: integer) FROM generate_series(1,135)), ''),
array_to_string(ARRAY(SELECT chr((48 + round(random() * 9)) :: integer)
FROM generate_series(1,135)), ''), array_to_string(ARRAY(SELECT chr((48 +
round(random() * 9)) :: integer) FROM generate_series(1,135)), '') FROM
(SELECT generate_series(1,1)) foo;

I loaded and timed each script as a file using

time psql -d mydb -f drop_tables.sql

> any non-default config settings?  Especially the setting of
shared_buffers (on both master and replica, if different)

Our shared_buffers setting is 8 GB on all nodes.

In prod, we had a primary and two standbys. One standby was very similar to
the primary in hardware, but older and with fewer CPUs. The other standby
was far inferior in hardware. For testing, I used the two standbys, and was
able to reproduce when I made the superior standby the primary.

These are the non-default settings on the primary and the standby of
comparable hardware (other than changing file and directory paths, which
shouldn't affect anything):

listen_addresses = '*'
log_destination = 'stderr, syslog'
log_filename = 'postgresql-%Y-%m-%d.log'
log_line_prefix = '%p [%m]: %u %d %q %h %a %v %x'
syslog_facility = 'local1'
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_timezone = 'US/Central'
log_statement = 'ddl'
track_functions = pl
track_activity_query_size = 4096
timezone = US/Central
client_encoding = UTF8
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
max_connections = 1024
tcp_keepalives_idle = 7200
tcp_keepalives_interval = 75
tcp_keepalives_count = 9
work_mem= '48 MB'
maintenance_work_mem= '1 GB'
max_locks_per_transaction = 8192
random_page_cost = 2.0
effective_cache_size = '94GB'
log_autovacuum_min_duration = 10s
autovacuum_naptime = 2min
autovacuum_vacuum_threshold = 500
autovacuum_analyze_threshold = 250
autovacuum_freeze_max_age = 10
autovacuum_vacuum_cost_delay = 10ms
autovacuum_vacuum_scale_factor = 0.01
autovacuum_analyze_scale_factor = 0.01
wal_buffers = 16MB
bgwriter_lru_maxpages = 250
max_wal_senders = 5
wal_keep_segments = 256
hot_standby = on
log_min_duration_statement = 2s
wal_level = replica
wal_log_hints = on
archive_mode = on
archive_command = 'ssh postgres@backupserver "test ! -f /opt/backup/db1/%f"
&& rsync -q %p postgres@backupserver:/opt/backup/db1/%f'
max_parallel_workers_per_gather = 4
max_worker_processes = 16

The only different setting on the second standby is effective_cache_size,
which is 24 GB instead of 94 GB.

To rule out fetching from the remote archive as the bottleneck, I tried
scp-ing a bunch of WAL files to the standby and resetting the restore
command to replay from the local archive. Same performance problem, and
only when dropping tables.

Best,
Sherrylyn


Re: Slow WAL recovery for DROP TABLE

2018-07-18 Thread Sherrylyn Branchaw
> Hi, I have also reported a similar problem in the hackers mailing list,
but particularly on TRUNCATE TABLE.
https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C62FD6E6%40g01jpexmbkw24

Ooh, interesting. I admit I did not include TRUNCATE in my testing.

> The problem lies with the standby server’s replay as it does separate
scanning of the whole shared buffer for each DROP/TRUNCATE TABLE in order
to check if the table-to-delete is cached in shared buffer. Therefore, it
will take a long recovery time and sometimes fail for large tables
depending on shared_buffer size.

Also very interesting. We only (?) have 8 GB of shared buffers, and I see
from your message that you had 300 GB. All of our tables, both in prod and
in my reproduction, were empty, but there were hundreds of thousands of
them.

> The main problem here is the scanning of shared_buffers, which not only
affects drop/truncate table, but also drop database and vacuum as well.

I wondered about that. I didn't have any problem with a single drop
database, but the database dropped was a small one (albeit one with a few
hundred thousand empty tables), and I neither tested dropping a large
database nor dropping 100,000 databases. I didn't test vacuuming, but we do
heavy vacuuming on all our primaries frequently, and...hmm. Regular
vacuuming doesn't cause any problems that have made it onto my radar, but
VACUUM FULLs can cause WAL files to pile up on the primary's pg_xlog before
getting archived. I never investigated that, just throttled my VACUUM
FULLs, because they're only ever run manually. I will keep an eye on the
recovery time of individual files the next time I have to do this, which
will probably be soon.

> But I think any working minor solutions/fixes from developers are also
welcome, such as the recent committed patch for the multiple dropped tables
per transaction with large shared_buffers.

Agreed. Should I have sent or should I still send this to pgsql-hackers? I
wasn't sure, so I erred on the side of not bothering the developers until
I'd gotten some feedback here.

Best,
Sherrylyn


pg_restore restores privileges differently from psql

2019-01-14 Thread Sherrylyn Branchaw
Hi,

I'm running two Postgres 9.6.11 databases on RHEL 6.9.

I'm restoring the schema from one database (prod) to another (dev). There
are users that exist in prod that don't exist in dev. When the restore job
tries to grant privileges to nonexistent users in dev, I would like it to
generate an error, which is safe to ignore, but still correctly grant
privileges to any user that does exist in dev.

That's the behavior I see when I dump to a plain file and restore it
using *psql
-f*, but not the behavior I see when I do a *pg_dump -Fc* followed by
*pg_restore.* *pg_restore *seems to treat all the *GRANT* statements for a
single object as a single statement, and when one errors out, they all
error out, meaning I'm left with no privileges on the object in question.

For instance, when this appears in my plaintext file:
GRANT ALL ON SCHEMA test TO user1;
GRANT USAGE ON SCHEMA test TO user2;
GRANT USAGE ON SCHEMA test TO user3;

and user1 doesn't exist on the target database, user2 and user3 get the
expected privileges when restoring from *psql* but not from *pg_restore*.

Here's a reproducible test case.

CREATE DATABASE prod_db;
CREATE DATABASE dev_db_pg_restore;
CREATE DATABASE dev_db_psql;
CREATE ROLE prod_user;
CREATE ROLE prod_and_dev_user;

-- in prod_db
CREATE SCHEMA test;
GRANT ALL ON SCHEMA test TO prod_user;
GRANT USAGE ON SCHEMA test TO prod_and_dev_user;

pg_dump -Fc prod_db &> prod_dump.bin
pg_dump prod_db &> prod_dump.sql

-- On database instance containing the dev dbs.
DROP ROLE prod_user;

pg_restore prod_dump.bin -d dev_db_pg_restore
psql -d dev_db_psql -f prod_dump.sql

-- In dev_db_psql
SELECT has_schema_privilege('prod_and_dev_user', 'test', 'usage');
Expected result: true
Actual result: true

-- In dev_db_pg_restore
SELECT has_schema_privilege('prod_and_dev_user', 'test', 'usage');
Expected result: true
Actual result: false

The behavior seems to be related to the fact that *pg_restore* reports the
failed command as containing all the semicolon-delimited privilege command,
which get executed separately when restoring from plaintext:

pg_restore: [archiver (db)] could not execute query: ERROR:  role
"prod_user" does not exist
Command was: GRANT ALL ON SCHEMA test TO prod_user;
GRANT USAGE ON SCHEMA test TO prod_and_dev_user;

As a workaround, I've created a unprivileged dummy user by this name on the
dev database, but my question is, is this a bug or feature? If a feature,
is the behavior documented? I didn't find any documentation, but that
doesn't mean it doesn't exist.

Thanks,
Sherrylyn


Re: pg_restore restores privileges differently from psql

2019-01-14 Thread Sherrylyn Branchaw
The above needs more information:

1) Are the dev_* databases on a different cluster?

2) If so did you run:

CREATE ROLE prod_user;
CREATE ROLE prod_and_dev_user;

on that cluster first?

I happened to put them all on the same cluster for my test case, in order
to reproduce the unexpected behavior I encountered in the wild, where the
prod and dev dbs happened to live on different clusters. In short, as long
as you make sure the *prod_user* exists on the source cluster at the time
when the dump is taken, and doesn't exist on the target cluster when the
restore is carried out, you get the behavior I saw.

Also if so:

In the restores below are you sure you are pointed at the same cluster
in each case?

Yes, I am sure. Both for the test case I was creating for the mailing list,
and for the script where I first encountered this in the wild. Worked like
a charm when I used *psql*, didn't do what I expected when I used
*pg_restore*.

What do you see if you do:

pg_restore -f prod_dump_restore.sql prod_dump.bin

and look in prod_dump_res?tore.sql?

This is exactly what I did when I was first trying to figure out what was
going on. I see

GRANT USAGE ON SCHEMA test TO prod_and_dev_user;
GRANT ALL ON SCHEMA test TO prod_user;

If I then use *psql* to load *prod_dump_restore.sql* to a cluster that
doesn't have the *prod_user *role, I get the expected behavior (
*prod_and_dev_user* has usage on the schema *test*), because *psql* treats
each of those statements as a separate command. *pg_restore* seems to treat
them as a single command, judging by the error message and the behavior.

Best,
Sherrylyn


Re: pg_restore restores privileges differently from psql

2019-01-14 Thread Sherrylyn Branchaw
I don't see how that can work:

test=# \c prod_db
You are now connected to database "prod_db" as user "postgres".
prod_db=# CREATE SCHEMA test;
CREATE SCHEMA
prod_db=# GRANT ALL ON SCHEMA test TO prod_user;


GRANT


prod_db=# GRANT USAGE ON SCHEMA test TO prod_and_dev_user;


GRANT

prod_db=# \c dev_db_psql
You are now connected to database "dev_db_psql" as user "postgres".


dev_db_psql=# DROP ROLE prod_user;
ERROR:  role "prod_user" cannot be dropped because some objects depend
on it

DETAIL:  1 object in database prod_db
Yes, if you're going to put all your databases on the same cluster, you
first have to remove dependent objects before dropping the role. There are
multiple ways of going about that: dropping the database, revoking the
privileges on the objects in question, etc. If you put the databases on
different clusters and make sure you create only the prod_and_dev_user on
the second cluster, you won't run into this issue.

The goal is to make sure the prod_user role exists when the dump is taken
and doesn't exist when the restore is done. You can do this by putting the
databases on separate clusters and creating the appropriate roles, or by
dropping the user on the single cluster.

Sorry, I considered spelling all this out in the original post, because
there are two different ways of going about making sure the user isn't
present for the restore, but it seemed unnecessarily complicated, and I
thought I would let people decide what makes sense in their own environment
for testing. Sorry if that led to more confusion in the end. Just make sure
the user exists when you need it to exist and doesn't exist when you need
it not to exist, and test both the pg_restore and psql methods, and I
expect you'll see the same behavior I did (and if not, I'll be very
curious).

Best,
Sherrylyn


Re: pg_restore restores privileges differently from psql

2019-01-15 Thread Sherrylyn Branchaw
Yeah, this is a known issue --- the various GRANTs for a specific object
are stored in a single "TOC entry" in the archive, which pg_restore will
send to the server in a single PQexec call, causing them to be effectively
one transaction.  The easiest way to deal with it is to not send
pg_restore's output directly to the target server, but feed it through
psql, something like

pg_restore  ... | psql [connection parameters]

There's been some discussion of a real fix, but it seems messy.
pg_restore doesn't have a parser that would be adequate to separate
out multiple SQL commands in a TOC entry, and we'd rather not try
to give it one (mainly because of fear of cross-version compatibility
issues).

Okay, thank you. I thought that might be the case: undesirable behavior
where the costs outweigh the benefits of fixing. Given that, would it be
worth making it more obvious in the pg_restore documentation that
pg_restore and its psql output don't always give the same results in the
target database?

Best,
Sherrylyn


Re: pg_upgrade --jobs

2019-04-07 Thread Sherrylyn Branchaw
are there any shortcuts to upgrading that would circumvent exporting the
entire schema?

By "shortcuts," do you mean you want to minimize the time and energy you
put into the upgrade, or that you want to minimize database downtime? If
you mean downtime, I was able to upgrade a customer-facing database with
~350,000 tables from Postgres 9.0 to 9.6 last year with only 86 seconds of
downtime, using Slony, but I had to make many custom modifications to Slony
and test thoroughly beforehand, and it was not for the faint of heart, the
pressed for time, or the inexperienced. There may be better ways (and if
so, I would be curious to learn about them), but Slony was the tool with
which I was most familiar at the time.

This method does, of course, require exporting the entire schema, but
because our only constraint was to minimize customer downtime, and the
database was online while the schema was being exported, we didn't care how
long it took. Your constraints may be different.

For those reading: we do know that 350,000 tables is Doing It Wrong, and
we're getting rid of them, but we decided being on an EOLed version of
Postgres was worse and should be fixed first.

Sherrylyn


Re: pg_upgrade --jobs

2019-04-07 Thread Sherrylyn Branchaw
> It may take a while for slony replication to be in sync, but when it is,
there will be very little down time to switch over.

I agree in principle, which is why I chose Slony over pg_upgrade for my
company's very similar situation, but my experience was that, out of the
box, Slony was projected to take unacceptably long (months) to sync our
350,000 tables, and downtime was going to be many hours. In order to get
those numbers down, I had to reduce the number of per-table steps Slony was
executing, e.g. by eliminating unnecessary-for-us ones and by rewriting
others to happen in bulk.

Here's something I didn't know Slony did when I didn't have 350,000 tables
to worry about: add a table to replication, run through the existing list
of replicated tables to make sure it's captured any changes that have
happened in the meantime, add one more table to replication, run through
the existing list of tables to make sure no changes have happened, and so
on. The more tables you add, the longer it takes to add the next table.
Here's another thing I didn't know it did: during the switchover, manage 4
triggers per table serially on primary and standby. 4 * 35 * 2 = 2.8
million triggers. (I knew it managed 4 triggers per table, but not that it
was serial or how long that would take when pg_trigger had almost 3 million
relevant rows.)

I would love to help the OP out in a more hands-on way (I have
upgrade-via-Slony consulting experience), as well as to open source the
custom changes I came up with, but I'm debating whether I have the time to
offer to take on another project right now. I'm also reluctant to summarize
exactly what I did, because messing with pg_catalog directly is very
delicate and likely to go wrong, and I don't recommend it to the
inexperienced all, or to the experienced if they have alternatives.

> Plan B is to drop a lot of tables and deal with imports later.

If it were me, I would take a copy of my database, restore it to a sandbox
environment, set up Slony, and get an estimate for the projected sync time.
Let it run for a few hours to see how dramatically the sync time slows down
as the number of tables grows. I.e. don't count the number of replicated
tables after an hour and assume that's your hourly rate of syncing. If it
looks like you can sync your entire database in a few days or less, then
let it run and test out your failover time. If that's acceptable, you're
good to go.

If sync time looks like it's going to be incredibly long on your schema +
hardware + Postgres version, etc., then failover time probably would be
too. In that case, temporarily dropping the tables you can drop may be
preferable to the complexity of making changes to speed up Slony, if you
don't have a seriously experienced DBA on hand.

Sherrylyn


Does pg_stat_get_live_tuples() matter?

2019-04-09 Thread Sherrylyn Branchaw
If a table has 1.8M rows in reality, yet pg_stat_get_live_tuples() returns
1.8K, does it matter to the performance of the database, as long as
pg_class.reltuples is the right order of magnitude?

The query planner seems to use the accurate estimate, but what about the
autovacuum daemon? Or anything else?

In short, is pg_stat_get_live_tuples() (and the views that invoke it) used
by any built-in tooling, or is it just there for the user to do monitoring?

Thanks,
Sherrylyn


Re: Does pg_stat_get_live_tuples() matter?

2019-04-09 Thread Sherrylyn Branchaw
What version of Postgres?

Sorry, can't believe I forgot to include that! Postgres 9.6.9.

How are you getting at the 1.8M number?

SELECT COUNT(*)


Re: Does pg_stat_get_live_tuples() matter?

2019-04-10 Thread Sherrylyn Branchaw
> Hmmm ... what was in reltuples again?

Reltuples had the correct number, and the query plans were showing the
correct estimates.

> it's a bit hard to believe that it could get to be off by 1000X.  Have
you suppressed autovacuum on this table?

No, but here are some things I've observed:

1)  Any time pg_stat_reset() gets run, pg_stat_user_tables.n_live_tup gets
reset to 0. pg_class.reltuples is untouched.

2) If new tuples get inserted or deleted after pg_stat_reset(),
pg_stat_user_tables.n_live_tup will match pg_stat_user_tables.n_tup_ins -
pg_stat_user_tables.n_tup_del up until the next analyze or autoanalyze.

3) Once the next (auto)analyze happens, pg_stat_user_tables.n_live_tup will
be updated to match pg_class.reltuples.

4) If a table is very large, it may be a while before
autovacuum_analyze_scale_factor is reached. We have ours set to 0.01, which
is an order of magnitude larger than 1.8K on a 1.8M row table. (I would
like to tune our settings more, but I have a list of priorities from
management on which this isn't high.)

5) Given 1-4, pg_stat_user_tables.n_live_tup may spend quite a long time
matching pg_stat_user_tables.n_tup_ins - pg_stat_user_tables.n_tup_del
instead of pg_class.reltuples. For example, if a table has 1.8 million
rows, and you insert 5 and delete 4 after a stats reset, n_live_tup will
report that the table has 1 tuple.

6)  Failovers, at least in Aurora, apparently cause pg_stat_reset() to be
run, at least judging by the timestamp I'm seeing in
pg_stat_bgwriter.stat_reset. We haven't done a failover in the data center
in a while, and it's less trivial for me to test there atm, so I'm not
certain whether open-source Postgres failovers also reset statistics.

> I don't see anything in the current core code that pays attention to
n_live_tuples.  reltuples definitely does matter to the planner, and some
of the sibling counters like n_dead_tuples drive autovacuum, but nothing is
examining n_live_tuples AFAICS.

That's what I thought, but I wanted to make sure I wasn't missing anything
obvious. Thanks!

> some of the sibling counters like n_dead_tuples drive autovacuum

So that's interesting. I knew that, but hadn't thought about the
implications. If pg_stat_reset() is executed by failovers, and a failover
happens just before a table is ready to be vacuumed--say it has 0.009 *
reltuples dead tuples (I'm simplifying the formula here)--then n_dead_tup
gets reset to 0 and the counting starts all over again. Regular failovers
could thus increase bloat by delaying the autovacuum daemon from
recognizing that a table needs to be vacuumed, am I right?

Is it recommended practice to manually VACUUM ANALYZE the whole database
after a failover? Or is resetting stats after a failover just an Aurora
thing? I'm sorry I'm asking the latter question instead of testing, but
I've been ordered not to spend time on improving our vacuuming and
statistics until 5 other large projects are done, and I'm spending a
minimal amount of time anyway just to see how badly frequent failovers
might be affecting us and if there's any action we need to take.

Thanks,
Sherrylyn