What to do when dynamic shared memory control segment is corrupt
Greetings, We are using Postgres 9.6.8 (planning to upgrade to 9.6.9 soon) on RHEL 6.9. We recently experienced two similar outages on two different prod databases. The error messages from the logs were as follows: LOG: server process (PID 138529) was terminated by signal 6: Aborted LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. We are still investigating what may have triggered these errors, since there were no recent changes to these databases. Unfortunately, core dumps were not configured correctly, so we may have to wait for the next outage before we can do a good root cause analysis. My question, meanwhile, is around remedial actions to take when this happens. In one case, the logs recorded LOG: all server processes terminated; reinitializing LOG: incomplete data in "postmaster.pid": found only 1 newlines while trying to add line 7 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 365/FDFA738 LOG: invalid record length at 365/12420978: wanted 24, got 0 LOG: redo done at 365/12420950 LOG: last completed transaction was at log time 2018-06-05 10:59:27.049443-05 LOG: checkpoint starting: end-of-recovery immediate LOG: checkpoint complete: wrote 5343 buffers (0.5%); 0 transaction log file(s) added, 1 removed, 0 recycled; write=0.131 s, sync=0.009 s, total=0.164 s; sync files=142, longest=0.005 s, average=0.000 s; distance=39064 kB, estimate=39064 kB LOG: MultiXact member wraparound protections are now enabled LOG: autovacuum launcher started LOG: database system is ready to accept connections In that case, the database restarted immediately, with only 30 seconds of downtime. In the other case, the logs recorded LOG: all server processes terminated; reinitializing LOG: dynamic shared memory control segment is corrupt LOG: incomplete data in "postmaster.pid": found only 1 newlines while trying to add line 7 In that case, the database did not restart on its own. It was 5 am on Sunday, so the on-call SRE just manually started the database up, and it appears to have been running fine since. My question is whether the corrupt shared memory control segment, and the failure of Postgres to automatically restart, mean the database should not be automatically started up, and if there's something we should be doing before restarting. Do we potentially have corrupt data or indices as a result of our last outage? If so, what should we do to investigate? Thanks, Sherrylyn
Re: What to do when dynamic shared memory control segment is corrupt
> Hm ... were these installations built with --enable-cassert? If not, > an abort trap seems pretty odd. The packages are installed directly from the yum repos for RHEL. I'm not aware that --enable-cassert is being used, and we're certainly not installing from source. > Those "incomplete data" messages are quite unexpected and disturbing. > I don't know of any mechanism within Postgres proper that would result > in corruption of the postmaster.pid file that way. (I wondered briefly > if trying to start a conflicting postmaster would result in such a > situation, but experimentation here says not.) I'm suspicious that > this may indicate a bug or unwarranted assumption in whatever scripts > you use to start/stop the postmaster. Whether that is at all related > to your crash issue is hard to say, but it bears looking into. We're using the stock initd script from the yum repo, but I dug into this issue, and it looks like we're passing the path to the postmaster.pid as the $pidfile variable in our sysconfig file, meaning the initd script is managing the postmaster.pid file, and specifically is overwriting it with a single line containing just the pid. I'm not sure why it's set up like this, and I'm thinking we should change it, but it seems harmless and unrelated to the crash. In particular, manual initd actions such as stop, start, restart, and status all work fine. > No, that looks like fairly typical crash recovery to me: corrupt shared > memory contents are expected and recovered from after a crash. That's reassuring. But if it's safe for us to immediately start the server back up, why did Postgres not automatically start the server up like it did the first time? I was assuming it was due to the presence of the corrupt memory segment, as that was the only difference in the logs, although I could be wrong. Automatic restart would have saved us a great deal of downtime; since in the first case we had total recovery within 30 seconds, and in the second case, many minutes of downtime while someone got paged, troubleshot the issue, and eventually decided to try starting the database back up. At any rate, if it's safe, we can write a script to detect this failure mode and automatically restart, although it would be less error-prone if Postgres restarted automatically. > Hm, I supposed that Sherrylyn would've noticed any PANIC entries in > the log. No PANICs. The log lines I pasted were the only ones that looked relevant in the Postgres logs. I can try to dig through the application logs, but I was planning to wait until the next time this happens, since we should have core dumps fixed and that might make things easier. > What extensions are installed, if any? In the first database, the one without the corrupt memory segment and that restarted automatically: plpgsql and postgres_fdw. In the second database, the one where the memory segment got corrupted and that didn't restart automatically: dblink, hstore, pg_trgm, pgstattuple, plpgsql, and tablefunc. I forgot to mention that the queries that got killed were innocuous-looking SELECTs that completed successfully for me in less than a second when I ran them manually. In other words, the problem was not reproducible. Sherrylyn
Re: What to do when dynamic shared memory control segment is corrupt
Yeah, I'd like to know that too. The complaint about corrupt shared memory may be just an unrelated red herring, or it might be a separate effect of whatever the primary failure was ... but I think it was likely not the direct cause of the failure-to-restart. Anyway, I would not be afraid to try restarting the postmaster manually if it died. Maybe don't do that repeatedly without human intervention; but PG is pretty robust against crashes. We developers crash it all the time, and we don't lose data. Understood, and thanks. I was basing my concern on a message in the mailing lists that suggested that postgres might fail to start up in the event of a corrupted memory segment. I would link to the message directly, but I keep getting backend server error messages when I try to search for it today. At any rate, it looked there was a chance that it was a deliberate design choice, and I didn't want to ignore it if so. It's good to know that this is not the case. I realize that you're most focused on less-downtime, but from my perspective it'd be good to worry about collecting evidence as to what happened exactly. Absolutely. I would love to know why this is happening too. However, our priorities have been set in part by a very tight deadline handed down from the C-levels to migrate to Aurora, so we have to focus our energies accordingly. I will be back with core files if this happens again before we're completely migrated over. Meanwhile, thank you for assuring me we have no current data corruption and that it's safe to restart next time without taking additional action to avoid or detect corruption. Best, Sherrylyn
Slow WAL recovery for DROP TABLE
We are running Postgres 9.6.9 on RHEL 6.9. We're using built-in streaming replication, with a WAL archive for fallback purposes. No logical replication. We recently had a bug in our code accidentally create several hundred thousand tables in a single database. A batch job started cleaning them up one at a time (separate transactions), over the course of a couple nights. But what we noticed the next day was that the 2 standbys to this database were falling further and further behind. They were no longer streaming, instead they were recovering from the archive, and they were doing so very slowly. Only a small amount of WAL was generated by the DROP TABLEs, relative to normal traffic, but the time to recover a single WAL file sometimes took over a minute, compared to 1 second normally. Meanwhile, new WALs were being generated by regular user traffic, and the database got 24 hours behind and getting worse, before we resynced the standbys, in order to not have to replay the WALs. That did the trick. But it seemed like this performance drop was drastic enough to be worth reporting. I was able to reproduce the slow recovery time of WALs generated by DROP TABLE on a pair of test servers. Normal writes (INSERT, UPDATE, DELETE, CREATE TABLE) generated WAL files that were recoverable on the order of 1 second each. The standby never fell behind, even when hundreds of WAL files were generated quickly. But when I generated WALs exclusively from DROP TABLE, it took 20-80 seconds to recover each one. Some benchmarks: For 10 minutes, I issued a bunch of rapid-fire INSERT statements to server1, the primary. These generated 144 WAL files. At the end of the 10 minutes, server2, the standby, had already recovered 143 of them and immediately completed the last one. For 10 minutes, I issued a bunch of rapid-fire DROP TABLE statements to server1, the primary. These generated 28 WAL files. At the end of the 10 minutes, server2, the standby, had only recovered 16 of them. My expectation is that replaying binary changes on a standby should be at least as fast and sometimes faster than generating them locally. This is because when a logical command like "DROP TABLE" or "INSERT" is issued on a primary, the database has to check for dependencies, satisfy constraints, etc. But once all those decisions are made on the primary, the binary changes should be replayable much faster on the standby. For example, server1, the primary in my test pair, has significantly better hardware than server2, the standby. If server2 is keeping up with server1 in replication, I predict that server2 should be able to get through the same amount of writes faster as a replica than as a standalone. This expectation was borne out in my testing for INSERTs but not DROP TABLEs, per the following benchmarks. I issued 9,000,000 INSERTs to server1, the primary. It took 10 minutes to complete these. At the end of the 10 minutes, server2, the standby, was caught up. When I made server2 a standalone, it took 15 minutes to get through the same 9,000,000 INSERTs. In conclusion, being a standalone was 50% slower for server2 when it came to INSERT. I issued 100,000 DROP TABLEs to server1, the primary. It took 10 minutes to complete these on server 1, and 20 minutes for server2, the standby, to catch up. When I made server2 a standalone, it took 18 minutes to get through the same 100,000 DROP TABLEs. In conclusion, being a standalone was 10% faster for server2 when it came to DROP TABLE. It seems that there is something especially inefficient about WAL files generated by DROP TABLE, or the process of recovering them. I also want to mention that recovering a WAL file generated by DROP TABLE commands consistently takes approximately 100% CPU on the standby, both in testing and in prod. In prod, we also saw high memory use for that process. The standby with 128 GB of RAM showed this: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 23279 postgres 20 0 57.0g 50g 5.1g R 100.0 40.0 1723:15 postgres: startup process recovering 000203B1008D The standby with 32 GB of RAM crashed overnight with an out-of-memory error. When I restarted the servers and let them resume recovering WALs, I watched %MEM and RES grow slowly and steadily before I gave up and resynced. No memory appeared to be freed up after each WAL file was replayed. I was not able to reproduce memory usage getting quite this high in the time I was allotted for testing, but I did witness steady climbing up to almost 20%, especially when I mixed DROP TABLEs with regular writes. I suspect there is a sweet (or bitter?) spot of write activity we hit in our prod workload that I was unable to reproduce in testing. I'm not sure if this is the right mailing list to report this behavior to, but I thought it was the best place to start. Let me know if there's any other information I should provide, or if I should send it to another list. Thanks, Sherrylyn
Re: Slow WAL recovery for DROP TABLE
> There was a recent commit for a similar performance problem, which will appear in 9.6.10. But that was specifically for cases where there were multiple dropped tables per transaction, and large shared_buffers. Interesting, and good to know, thanks! I'm not sure we fall under either (is 8 GB large? It's larger than the default, but I always thought large was defined as "more than 8GB" for this setting), but it sounds like this sort of problem is on the developers' radars. It's possible up to 4 tables were dropped per transaction in prod, but I don't know if that's enough to count as "multiple", and in testing, I reproduced the problem with 1 drop per transaction. > I can't reproduce your single-drop-per-transaction problem. The replica has no problem keeping up with the master. It's possible that the problem only occurs when the replica is on inferior hardware. I was unable to test equal servers in the time I had. I noticed that when the superior server was the replica, it was able to keep up with the inferior replica, but that dropping tables was the only action for which the inferior server wasn't able to keep up with as a standby, and the only action for which the standalone outperformed the replica. I did not test truncates; it's possible I would have seen the same problem with it. > Can you share the reproduction scripts For the table drops, I prepped by running these: CREATE TABLE IF NOT EXISTS test1 (id int); CREATE TABLE IF NOT EXISTS test2 (id int); CREATE TABLE IF NOT EXISTS test3 (id int); ... CREATE TABLE IF NOT EXISTS test10 (id int); Then I dropped with these: DROP TABLE IF EXISTS test1; DROP TABLE IF EXISTS test2; DROP TABLE IF EXISTS test3; ... DROP TABLE IF EXISTS test10; For the inserts, I did a CREATE TABLE test (content1 TEXT, content2 TEXT, content3 TEXT); followed by 300 of these statements, which insert a random string into each of the three columns, in batches of 1 rows: INSERT INTO test SELECT array_to_string(ARRAY(SELECT chr((48 + round(random() * 9)) :: integer) FROM generate_series(1,135)), ''), array_to_string(ARRAY(SELECT chr((48 + round(random() * 9)) :: integer) FROM generate_series(1,135)), ''), array_to_string(ARRAY(SELECT chr((48 + round(random() * 9)) :: integer) FROM generate_series(1,135)), '') FROM (SELECT generate_series(1,1)) foo; I loaded and timed each script as a file using time psql -d mydb -f drop_tables.sql > any non-default config settings? Especially the setting of shared_buffers (on both master and replica, if different) Our shared_buffers setting is 8 GB on all nodes. In prod, we had a primary and two standbys. One standby was very similar to the primary in hardware, but older and with fewer CPUs. The other standby was far inferior in hardware. For testing, I used the two standbys, and was able to reproduce when I made the superior standby the primary. These are the non-default settings on the primary and the standby of comparable hardware (other than changing file and directory paths, which shouldn't affect anything): listen_addresses = '*' log_destination = 'stderr, syslog' log_filename = 'postgresql-%Y-%m-%d.log' log_line_prefix = '%p [%m]: %u %d %q %h %a %v %x' syslog_facility = 'local1' log_checkpoints = on log_connections = on log_disconnections = on log_lock_waits = on log_temp_files = 0 log_timezone = 'US/Central' log_statement = 'ddl' track_functions = pl track_activity_query_size = 4096 timezone = US/Central client_encoding = UTF8 lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' max_connections = 1024 tcp_keepalives_idle = 7200 tcp_keepalives_interval = 75 tcp_keepalives_count = 9 work_mem= '48 MB' maintenance_work_mem= '1 GB' max_locks_per_transaction = 8192 random_page_cost = 2.0 effective_cache_size = '94GB' log_autovacuum_min_duration = 10s autovacuum_naptime = 2min autovacuum_vacuum_threshold = 500 autovacuum_analyze_threshold = 250 autovacuum_freeze_max_age = 10 autovacuum_vacuum_cost_delay = 10ms autovacuum_vacuum_scale_factor = 0.01 autovacuum_analyze_scale_factor = 0.01 wal_buffers = 16MB bgwriter_lru_maxpages = 250 max_wal_senders = 5 wal_keep_segments = 256 hot_standby = on log_min_duration_statement = 2s wal_level = replica wal_log_hints = on archive_mode = on archive_command = 'ssh postgres@backupserver "test ! -f /opt/backup/db1/%f" && rsync -q %p postgres@backupserver:/opt/backup/db1/%f' max_parallel_workers_per_gather = 4 max_worker_processes = 16 The only different setting on the second standby is effective_cache_size, which is 24 GB instead of 94 GB. To rule out fetching from the remote archive as the bottleneck, I tried scp-ing a bunch of WAL files to the standby and resetting the restore command to replay from the local archive. Same performance problem, and only when dropping tables. Best, Sherrylyn
Re: Slow WAL recovery for DROP TABLE
> Hi, I have also reported a similar problem in the hackers mailing list, but particularly on TRUNCATE TABLE. https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C62FD6E6%40g01jpexmbkw24 Ooh, interesting. I admit I did not include TRUNCATE in my testing. > The problem lies with the standby server’s replay as it does separate scanning of the whole shared buffer for each DROP/TRUNCATE TABLE in order to check if the table-to-delete is cached in shared buffer. Therefore, it will take a long recovery time and sometimes fail for large tables depending on shared_buffer size. Also very interesting. We only (?) have 8 GB of shared buffers, and I see from your message that you had 300 GB. All of our tables, both in prod and in my reproduction, were empty, but there were hundreds of thousands of them. > The main problem here is the scanning of shared_buffers, which not only affects drop/truncate table, but also drop database and vacuum as well. I wondered about that. I didn't have any problem with a single drop database, but the database dropped was a small one (albeit one with a few hundred thousand empty tables), and I neither tested dropping a large database nor dropping 100,000 databases. I didn't test vacuuming, but we do heavy vacuuming on all our primaries frequently, and...hmm. Regular vacuuming doesn't cause any problems that have made it onto my radar, but VACUUM FULLs can cause WAL files to pile up on the primary's pg_xlog before getting archived. I never investigated that, just throttled my VACUUM FULLs, because they're only ever run manually. I will keep an eye on the recovery time of individual files the next time I have to do this, which will probably be soon. > But I think any working minor solutions/fixes from developers are also welcome, such as the recent committed patch for the multiple dropped tables per transaction with large shared_buffers. Agreed. Should I have sent or should I still send this to pgsql-hackers? I wasn't sure, so I erred on the side of not bothering the developers until I'd gotten some feedback here. Best, Sherrylyn
pg_restore restores privileges differently from psql
Hi, I'm running two Postgres 9.6.11 databases on RHEL 6.9. I'm restoring the schema from one database (prod) to another (dev). There are users that exist in prod that don't exist in dev. When the restore job tries to grant privileges to nonexistent users in dev, I would like it to generate an error, which is safe to ignore, but still correctly grant privileges to any user that does exist in dev. That's the behavior I see when I dump to a plain file and restore it using *psql -f*, but not the behavior I see when I do a *pg_dump -Fc* followed by *pg_restore.* *pg_restore *seems to treat all the *GRANT* statements for a single object as a single statement, and when one errors out, they all error out, meaning I'm left with no privileges on the object in question. For instance, when this appears in my plaintext file: GRANT ALL ON SCHEMA test TO user1; GRANT USAGE ON SCHEMA test TO user2; GRANT USAGE ON SCHEMA test TO user3; and user1 doesn't exist on the target database, user2 and user3 get the expected privileges when restoring from *psql* but not from *pg_restore*. Here's a reproducible test case. CREATE DATABASE prod_db; CREATE DATABASE dev_db_pg_restore; CREATE DATABASE dev_db_psql; CREATE ROLE prod_user; CREATE ROLE prod_and_dev_user; -- in prod_db CREATE SCHEMA test; GRANT ALL ON SCHEMA test TO prod_user; GRANT USAGE ON SCHEMA test TO prod_and_dev_user; pg_dump -Fc prod_db &> prod_dump.bin pg_dump prod_db &> prod_dump.sql -- On database instance containing the dev dbs. DROP ROLE prod_user; pg_restore prod_dump.bin -d dev_db_pg_restore psql -d dev_db_psql -f prod_dump.sql -- In dev_db_psql SELECT has_schema_privilege('prod_and_dev_user', 'test', 'usage'); Expected result: true Actual result: true -- In dev_db_pg_restore SELECT has_schema_privilege('prod_and_dev_user', 'test', 'usage'); Expected result: true Actual result: false The behavior seems to be related to the fact that *pg_restore* reports the failed command as containing all the semicolon-delimited privilege command, which get executed separately when restoring from plaintext: pg_restore: [archiver (db)] could not execute query: ERROR: role "prod_user" does not exist Command was: GRANT ALL ON SCHEMA test TO prod_user; GRANT USAGE ON SCHEMA test TO prod_and_dev_user; As a workaround, I've created a unprivileged dummy user by this name on the dev database, but my question is, is this a bug or feature? If a feature, is the behavior documented? I didn't find any documentation, but that doesn't mean it doesn't exist. Thanks, Sherrylyn
Re: pg_restore restores privileges differently from psql
The above needs more information: 1) Are the dev_* databases on a different cluster? 2) If so did you run: CREATE ROLE prod_user; CREATE ROLE prod_and_dev_user; on that cluster first? I happened to put them all on the same cluster for my test case, in order to reproduce the unexpected behavior I encountered in the wild, where the prod and dev dbs happened to live on different clusters. In short, as long as you make sure the *prod_user* exists on the source cluster at the time when the dump is taken, and doesn't exist on the target cluster when the restore is carried out, you get the behavior I saw. Also if so: In the restores below are you sure you are pointed at the same cluster in each case? Yes, I am sure. Both for the test case I was creating for the mailing list, and for the script where I first encountered this in the wild. Worked like a charm when I used *psql*, didn't do what I expected when I used *pg_restore*. What do you see if you do: pg_restore -f prod_dump_restore.sql prod_dump.bin and look in prod_dump_res?tore.sql? This is exactly what I did when I was first trying to figure out what was going on. I see GRANT USAGE ON SCHEMA test TO prod_and_dev_user; GRANT ALL ON SCHEMA test TO prod_user; If I then use *psql* to load *prod_dump_restore.sql* to a cluster that doesn't have the *prod_user *role, I get the expected behavior ( *prod_and_dev_user* has usage on the schema *test*), because *psql* treats each of those statements as a separate command. *pg_restore* seems to treat them as a single command, judging by the error message and the behavior. Best, Sherrylyn
Re: pg_restore restores privileges differently from psql
I don't see how that can work: test=# \c prod_db You are now connected to database "prod_db" as user "postgres". prod_db=# CREATE SCHEMA test; CREATE SCHEMA prod_db=# GRANT ALL ON SCHEMA test TO prod_user; GRANT prod_db=# GRANT USAGE ON SCHEMA test TO prod_and_dev_user; GRANT prod_db=# \c dev_db_psql You are now connected to database "dev_db_psql" as user "postgres". dev_db_psql=# DROP ROLE prod_user; ERROR: role "prod_user" cannot be dropped because some objects depend on it DETAIL: 1 object in database prod_db Yes, if you're going to put all your databases on the same cluster, you first have to remove dependent objects before dropping the role. There are multiple ways of going about that: dropping the database, revoking the privileges on the objects in question, etc. If you put the databases on different clusters and make sure you create only the prod_and_dev_user on the second cluster, you won't run into this issue. The goal is to make sure the prod_user role exists when the dump is taken and doesn't exist when the restore is done. You can do this by putting the databases on separate clusters and creating the appropriate roles, or by dropping the user on the single cluster. Sorry, I considered spelling all this out in the original post, because there are two different ways of going about making sure the user isn't present for the restore, but it seemed unnecessarily complicated, and I thought I would let people decide what makes sense in their own environment for testing. Sorry if that led to more confusion in the end. Just make sure the user exists when you need it to exist and doesn't exist when you need it not to exist, and test both the pg_restore and psql methods, and I expect you'll see the same behavior I did (and if not, I'll be very curious). Best, Sherrylyn
Re: pg_restore restores privileges differently from psql
Yeah, this is a known issue --- the various GRANTs for a specific object are stored in a single "TOC entry" in the archive, which pg_restore will send to the server in a single PQexec call, causing them to be effectively one transaction. The easiest way to deal with it is to not send pg_restore's output directly to the target server, but feed it through psql, something like pg_restore ... | psql [connection parameters] There's been some discussion of a real fix, but it seems messy. pg_restore doesn't have a parser that would be adequate to separate out multiple SQL commands in a TOC entry, and we'd rather not try to give it one (mainly because of fear of cross-version compatibility issues). Okay, thank you. I thought that might be the case: undesirable behavior where the costs outweigh the benefits of fixing. Given that, would it be worth making it more obvious in the pg_restore documentation that pg_restore and its psql output don't always give the same results in the target database? Best, Sherrylyn
Re: pg_upgrade --jobs
are there any shortcuts to upgrading that would circumvent exporting the entire schema? By "shortcuts," do you mean you want to minimize the time and energy you put into the upgrade, or that you want to minimize database downtime? If you mean downtime, I was able to upgrade a customer-facing database with ~350,000 tables from Postgres 9.0 to 9.6 last year with only 86 seconds of downtime, using Slony, but I had to make many custom modifications to Slony and test thoroughly beforehand, and it was not for the faint of heart, the pressed for time, or the inexperienced. There may be better ways (and if so, I would be curious to learn about them), but Slony was the tool with which I was most familiar at the time. This method does, of course, require exporting the entire schema, but because our only constraint was to minimize customer downtime, and the database was online while the schema was being exported, we didn't care how long it took. Your constraints may be different. For those reading: we do know that 350,000 tables is Doing It Wrong, and we're getting rid of them, but we decided being on an EOLed version of Postgres was worse and should be fixed first. Sherrylyn
Re: pg_upgrade --jobs
> It may take a while for slony replication to be in sync, but when it is, there will be very little down time to switch over. I agree in principle, which is why I chose Slony over pg_upgrade for my company's very similar situation, but my experience was that, out of the box, Slony was projected to take unacceptably long (months) to sync our 350,000 tables, and downtime was going to be many hours. In order to get those numbers down, I had to reduce the number of per-table steps Slony was executing, e.g. by eliminating unnecessary-for-us ones and by rewriting others to happen in bulk. Here's something I didn't know Slony did when I didn't have 350,000 tables to worry about: add a table to replication, run through the existing list of replicated tables to make sure it's captured any changes that have happened in the meantime, add one more table to replication, run through the existing list of tables to make sure no changes have happened, and so on. The more tables you add, the longer it takes to add the next table. Here's another thing I didn't know it did: during the switchover, manage 4 triggers per table serially on primary and standby. 4 * 35 * 2 = 2.8 million triggers. (I knew it managed 4 triggers per table, but not that it was serial or how long that would take when pg_trigger had almost 3 million relevant rows.) I would love to help the OP out in a more hands-on way (I have upgrade-via-Slony consulting experience), as well as to open source the custom changes I came up with, but I'm debating whether I have the time to offer to take on another project right now. I'm also reluctant to summarize exactly what I did, because messing with pg_catalog directly is very delicate and likely to go wrong, and I don't recommend it to the inexperienced all, or to the experienced if they have alternatives. > Plan B is to drop a lot of tables and deal with imports later. If it were me, I would take a copy of my database, restore it to a sandbox environment, set up Slony, and get an estimate for the projected sync time. Let it run for a few hours to see how dramatically the sync time slows down as the number of tables grows. I.e. don't count the number of replicated tables after an hour and assume that's your hourly rate of syncing. If it looks like you can sync your entire database in a few days or less, then let it run and test out your failover time. If that's acceptable, you're good to go. If sync time looks like it's going to be incredibly long on your schema + hardware + Postgres version, etc., then failover time probably would be too. In that case, temporarily dropping the tables you can drop may be preferable to the complexity of making changes to speed up Slony, if you don't have a seriously experienced DBA on hand. Sherrylyn
Does pg_stat_get_live_tuples() matter?
If a table has 1.8M rows in reality, yet pg_stat_get_live_tuples() returns 1.8K, does it matter to the performance of the database, as long as pg_class.reltuples is the right order of magnitude? The query planner seems to use the accurate estimate, but what about the autovacuum daemon? Or anything else? In short, is pg_stat_get_live_tuples() (and the views that invoke it) used by any built-in tooling, or is it just there for the user to do monitoring? Thanks, Sherrylyn
Re: Does pg_stat_get_live_tuples() matter?
What version of Postgres? Sorry, can't believe I forgot to include that! Postgres 9.6.9. How are you getting at the 1.8M number? SELECT COUNT(*)
Re: Does pg_stat_get_live_tuples() matter?
> Hmmm ... what was in reltuples again? Reltuples had the correct number, and the query plans were showing the correct estimates. > it's a bit hard to believe that it could get to be off by 1000X. Have you suppressed autovacuum on this table? No, but here are some things I've observed: 1) Any time pg_stat_reset() gets run, pg_stat_user_tables.n_live_tup gets reset to 0. pg_class.reltuples is untouched. 2) If new tuples get inserted or deleted after pg_stat_reset(), pg_stat_user_tables.n_live_tup will match pg_stat_user_tables.n_tup_ins - pg_stat_user_tables.n_tup_del up until the next analyze or autoanalyze. 3) Once the next (auto)analyze happens, pg_stat_user_tables.n_live_tup will be updated to match pg_class.reltuples. 4) If a table is very large, it may be a while before autovacuum_analyze_scale_factor is reached. We have ours set to 0.01, which is an order of magnitude larger than 1.8K on a 1.8M row table. (I would like to tune our settings more, but I have a list of priorities from management on which this isn't high.) 5) Given 1-4, pg_stat_user_tables.n_live_tup may spend quite a long time matching pg_stat_user_tables.n_tup_ins - pg_stat_user_tables.n_tup_del instead of pg_class.reltuples. For example, if a table has 1.8 million rows, and you insert 5 and delete 4 after a stats reset, n_live_tup will report that the table has 1 tuple. 6) Failovers, at least in Aurora, apparently cause pg_stat_reset() to be run, at least judging by the timestamp I'm seeing in pg_stat_bgwriter.stat_reset. We haven't done a failover in the data center in a while, and it's less trivial for me to test there atm, so I'm not certain whether open-source Postgres failovers also reset statistics. > I don't see anything in the current core code that pays attention to n_live_tuples. reltuples definitely does matter to the planner, and some of the sibling counters like n_dead_tuples drive autovacuum, but nothing is examining n_live_tuples AFAICS. That's what I thought, but I wanted to make sure I wasn't missing anything obvious. Thanks! > some of the sibling counters like n_dead_tuples drive autovacuum So that's interesting. I knew that, but hadn't thought about the implications. If pg_stat_reset() is executed by failovers, and a failover happens just before a table is ready to be vacuumed--say it has 0.009 * reltuples dead tuples (I'm simplifying the formula here)--then n_dead_tup gets reset to 0 and the counting starts all over again. Regular failovers could thus increase bloat by delaying the autovacuum daemon from recognizing that a table needs to be vacuumed, am I right? Is it recommended practice to manually VACUUM ANALYZE the whole database after a failover? Or is resetting stats after a failover just an Aurora thing? I'm sorry I'm asking the latter question instead of testing, but I've been ordered not to spend time on improving our vacuuming and statistics until 5 other large projects are done, and I'm spending a minimal amount of time anyway just to see how badly frequent failovers might be affecting us and if there's any action we need to take. Thanks, Sherrylyn