Fwd: PG12 autovac issues
Apologies, I accidentally sent this to the pgsql-admin list initially but intended it go here: We have a database that isn't overly large (~20G), but gets incredibly frequent updates. Here's an example table: feedi=# select * from pg_stat_all_tables where schemaname = 'production' and relname = 'tita'; relid = 16786 schemaname = production relname = tita seq_scan = 23365 seq_tup_read = 403862091 idx_scan = 26612759248 idx_tup_fetch = 19415752701 n_tup_ins = 24608806 n_tup_upd = 4207076934 n_tup_del = 24566916 n_tup_hot_upd = 4073821079 n_live_tup = 79942 n_dead_tup = 71969 n_mod_since_analyze = 12020 last_vacuum = 2020-03-17 15:35:19.588859+00 last_autovacuum = 2020-03-17 21:31:08.248598+00 last_analyze = 2020-03-17 15:35:20.372875+00 last_autoanalyze = 2020-03-17 22:04:41.76743+00 vacuum_count = 9 autovacuum_count = 135693 analyze_count = 9 autoanalyze_count = 495877 As you can see in this table, there are only ~80K rows, but billions of updates. What we have observed is that the frozenxid reaches the 200M mark fairly quickly because of the amount of activity. What is interesting is that this happens with the 'postgres' and 'template1' databases as well and there is absolutely no activity in those databases. When the 'postgres' and/or 'template1' databases hit the freeze_max_age, there are cases where it kicks off an aggressive autovac of those tables which seems to prevent autovacs from running elsewhere. Oddly, this is not consistent, but that condition seems to be required. We have observed this across multiple PG12 servers (dev, test, staging, production) all with similar workloads. $ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c 17 Mar 17 06 34 Mar 17 07 31 Mar 17 08 31 Mar 17 09 30 Mar 17 10 34 Mar 17 11 33 Mar 17 12 19 Mar 17 13 40 Mar 17 15 31 Mar 17 16 36 Mar 17 17 34 Mar 17 18 35 Mar 17 19 35 Mar 17 20 33 Mar 17 21 As you can see above, we end up having around ~33 autovac/hr, and about 13:30 today, they stopped until we ran a "vacuum freeze verbose analyze;" against the 'postgres' database (around 15:30) which then caused the autovacs to resume running against the "feedi" database. I'm completely perplexed as to what is happening and why it suddenly started when we moved from PG10 > PG12. The configs and workload are essentially the same between versions. We realize we could simply increase the autovacuum_freeze_max_age, but that doesn't seem to actually resolve anything -- it just pushes the problem out. Has anyone seen anything similar to this? Thanks very much for the consideration. Justin King http://flightaware.com/
Re: Fwd: PG12 autovac issues
On Tue, Mar 17, 2020 at 5:39 PM Adrian Klaver wrote: > > On 3/17/20 3:22 PM, Justin King wrote: > > Apologies, I accidentally sent this to the pgsql-admin list initially > > but intended it go here: > > > > We have a database that isn't overly large (~20G), but gets incredibly > > frequent updates. Here's an example table: > > > > feedi=# select * from pg_stat_all_tables where schemaname = > > 'production' and relname = 'tita'; > > relid = 16786 > > schemaname = production > > relname = tita > > seq_scan = 23365 > > seq_tup_read = 403862091 > > idx_scan = 26612759248 > > idx_tup_fetch = 19415752701 > > n_tup_ins = 24608806 > > n_tup_upd = 4207076934 > > n_tup_del = 24566916 > > n_tup_hot_upd = 4073821079 > > n_live_tup = 79942 > > n_dead_tup = 71969 > > n_mod_since_analyze = 12020 > > last_vacuum = 2020-03-17 15:35:19.588859+00 > > last_autovacuum = 2020-03-17 21:31:08.248598+00 > > last_analyze = 2020-03-17 15:35:20.372875+00 > > last_autoanalyze = 2020-03-17 22:04:41.76743+00 > > vacuum_count = 9 > > autovacuum_count = 135693 > > analyze_count = 9 > > autoanalyze_count = 495877 > > > > As you can see in this table, there are only ~80K rows, but billions > > of updates. What we have observed is that the frozenxid reaches the > > 200M mark fairly quickly because of the amount of activity. What is > > interesting is that this happens with the 'postgres' and 'template1' > > databases as well and there is absolutely no activity in those > > databases. > > > > When the 'postgres' and/or 'template1' databases hit the > > freeze_max_age, there are cases where it kicks off an aggressive > > autovac of those tables which seems to prevent autovacs from running > > elsewhere. Oddly, this is not consistent, but that condition seems to > > be required. We have observed this across multiple PG12 servers (dev, > > test, staging, production) all with similar workloads. > > Is there anything in postgres and template1 besides what was created at > init? There is nothing in there at all besides system tables created at init. > > What are your settings for autovacuum?: > > https://www.postgresql.org/docs/12/runtime-config-autovacuum.html Here are the settings, these are the only ones that are not set to default with the exception of a few tables that have been overridden with a different value due to lots of updates and few rows: autovacuum = on log_autovacuum_min_duration = 0 autovacuum_max_workers = 8 autovacuum_naptime = 15s autovacuum_vacuum_threshold = 500 autovacuum_analyze_threshold = 2500 vacuum_cost_limit = 1000 We want fairly aggressive autovacs to keep table bloat limited -- the application latency suffers if it has to wade through dead tuples and staying near realtime is important in our environment. ** Also, it should be noted that the autovacuum_analyze_threshold is probably an incorrect value, we likely intended that to be 250 and just have now realized it after poking more at the configuration. > > > > > $ grep -i vacuum /var/log/postgresql/postgres.log | cut -b 1-9 | uniq -c > > 17 Mar 17 06 > > 34 Mar 17 07 > > 31 Mar 17 08 > > 31 Mar 17 09 > > 30 Mar 17 10 > > 34 Mar 17 11 > > 33 Mar 17 12 > > 19 Mar 17 13 > > 40 Mar 17 15 > > 31 Mar 17 16 > > 36 Mar 17 17 > > 34 Mar 17 18 > > 35 Mar 17 19 > > 35 Mar 17 20 > > 33 Mar 17 21 > > > > As you can see above, we end up having around ~33 autovac/hr, and > > about 13:30 today, they stopped until we ran a "vacuum freeze verbose > > analyze;" against the 'postgres' database (around 15:30) which then > > caused the autovacs to resume running against the "feedi" database. > > > > I'm completely perplexed as to what is happening and why it suddenly > > started when we moved from PG10 > PG12. The configs and workload are > > essentially the same between versions. We realize we could simply > > increase the autovacuum_freeze_max_age, but that doesn't seem to > > actually resolve anything -- it just pushes the problem out. Has > > anyone seen anything similar to this? > > > > Thanks very much for the consideration. > > > > Justin King > > http://flightaware.com/ > > > > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com
RE: PG12 autovac issues
Hi Andres- Thanks for the reply, answers below. On Tue, Mar 17, 2020 at 8:19 PM Andres Freund wrote: > > Hi, > > On 2020-03-17 17:18:57 -0500, Justin King wrote: > > As you can see in this table, there are only ~80K rows, but billions > > of updates. What we have observed is that the frozenxid reaches the > > 200M mark fairly quickly because of the amount of activity. > > And each of those updates is in a separate transaction? Is that > required? I.e. any chance to perform multiple of those updates in one > transaction? > > Have you considered just increasing the vacuum limit? It's mostly there > because it can increase space usage a bit, but given today's systems its > not a usually a problem unless you have hundreds of postgres instances > on a single system. It's pretty common to just increase that to 1.8 > billion or so. We have considered increasing the limit, but as I mentioned, the problem that we're seeing is that (effectively) a autovac starts on a system database (postgres, template1) and never completes, or deadlocks, or something. This completely stops autovacs from running until we manually intervene and run a VACUUM FREEZE -- at which point, the autovacs resume. If we increase the vacuum limit and this situation happens, we're going to be in real trouble. > From a single stats snapshot we can't actually understand the actual xid > consumption - is it actually the xid usage that triggers the vacuums? We have looked at this and the xid consumption averages around 1250 xid/sec -- this is when we see the "aggressive" autovac kick off in the logs. What I don't understand is why these xid's are being consumed at this rate on the databases with no activity (postgres, template1). > > > What makes you think it is a problem that you have all these vacuums? If > you actually update that much, and you have indexes, you're going want a > lot of vacuums? > I actually don't think there's a problem with the vacuums (I was mostly pointing out that they are very regular and not problematic). The main problem I am having is that something is causing the autovacuums to completely stop and require manual intervention to resume -- and it seems to be when the "postgres" or "template1" database hits the autovacuum_freeze_max_age. > > > What is interesting is that this happens with the 'postgres' and > > 'template1' databases as well and there is absolutely no activity in > > those databases. > > That's normal. They should be pretty darn quick in v12? Yes, a manual VACUUM FREEZE of either database takes less than 1 second -- which is why it's perplexing that the autovac starts and never seems to complete and prevents other autovacs from running. > > Greetings, > > Andres Freund
Re: Re: PG12 autovac issues
On Wed, Mar 18, 2020 at 10:13 AM Adrian Klaver wrote: > > On 3/18/20 6:57 AM, Justin King wrote: > Please reply to list also > Ccing list > > > >>> Here are the settings, these are the only ones that are not set to > >>> default with the exception of a few tables that have been overridden > >>> with a different value due to lots of updates and few rows: > >> > >> And those values are? > > > > Thanks for the response, hopefully this will help: > > The below is helpful, but what I was referring to above was the settings > for the overridden tables. Ah, apologies, I missed that. They are: alter table production.tita set (autovacuum_analyze_scale_factor = 0, autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 10, autovacuum_analyze_threshold = 5); > > > > > postgres=# select name,setting from pg_settings where name like '%vacuum%'; > > name = setting > > autovacuum = on > > autovacuum_analyze_scale_factor = 0.1 > > autovacuum_analyze_threshold = 2500 > > autovacuum_freeze_max_age = 2 > > autovacuum_max_workers = 8 > > autovacuum_multixact_freeze_max_age = 4 > > autovacuum_naptime = 15 > > autovacuum_vacuum_cost_delay = 20 > > autovacuum_vacuum_cost_limit = -1 > > autovacuum_vacuum_scale_factor = 0.2 > > autovacuum_vacuum_threshold = 500 > > autovacuum_work_mem = -1 > > log_autovacuum_min_duration = 0 > > vacuum_cleanup_index_scale_factor = 0.1 > > vacuum_cost_delay = 0 > > vacuum_cost_limit = 1000 > > vacuum_cost_page_dirty = 20 > > vacuum_cost_page_hit = 1 > > vacuum_cost_page_miss = 10 > > vacuum_defer_cleanup_age = 0 > > vacuum_freeze_min_age = 5000 > > vacuum_freeze_table_age = 15000 > > vacuum_multixact_freeze_min_age = 500 > > vacuum_multixact_freeze_table_age = 15000 > > > >> > >> More below. > >> > >>> > >>> autovacuum = on > >>> log_autovacuum_min_duration = 0 > >>> autovacuum_max_workers = 8 > >>> autovacuum_naptime = 15s > >>> autovacuum_vacuum_threshold = 500 > >>> autovacuum_analyze_threshold = 2500 > >>> vacuum_cost_limit = 1000 > >> > >> Are either of the below set > 0?: > >> > >> vacuum_cost_delay > >> > >> autovacuum_vacuum_cost_delay > >> > >>> > >>> We want fairly aggressive autovacs to keep table bloat limited -- the > >>> application latency suffers if it has to wade through dead tuples and > >>> staying near realtime is important in our environment. > >>> > >>> ** Also, it should be noted that the autovacuum_analyze_threshold is > >>> probably an incorrect value, we likely intended that to be 250 and > >>> just have now realized it after poking more at the configuration. > >>> > >>>> > >> > >> > >> > >> -- > >> Adrian Klaver > >> adrian.kla...@aklaver.com > > > -- > Adrian Klaver > adrian.kla...@aklaver.com
Re: Fwd: PG12 autovac issues
On Wed, Mar 18, 2020 at 1:40 PM Michael Lewis wrote: > > Do you have default fillfactor set on this table? If not, I would wonder if > reducing it to 50% or even 20% would allow many more HOT updates that would > reduce bloat. I don't believe we have a default fillfactor, but I'm still trying to understand why autovacs would completely stop -- that seems like a bug. Especially since there was no change between PG10 and PG12 and this problem never existed there. > Also, is there any period of lower activity on your system that you could > schedule a vacuum freeze for daily or weekly? I believe having frozen pages > would also mean all the autovacuums would be able to skip more pages and > therefore be faster. > > >> autovacuum_vacuum_cost_delay = 20 > > This was changed to 2ms in PG12. You should reduce that most likely. Actually, we set that back from 2ms > 20ms to replicate what PG10 was doing just in case it had an impact (spoiler: it did not).
Re: Fwd: PG12 autovac issues
On Thu, Mar 19, 2020 at 11:02 AM Michael Lewis wrote: > > On Thu, Mar 19, 2020 at 9:31 AM Justin King wrote: >> >> On Wed, Mar 18, 2020 at 1:40 PM Michael Lewis wrote: >> > >> > Do you have default fillfactor set on this table? If not, I would wonder >> > if reducing it to 50% or even 20% would allow many more HOT updates that >> > would reduce bloat. >> >> I don't believe we have a default fillfactor, but I'm still trying to >> understand why autovacs would completely stop -- that seems like a >> bug. Especially since there was no change between PG10 and PG12 and >> this problem never existed there. > > > Is there any reason to not schedule vacuum freeze for each db daily? Just > curious. No, not really -- it just feels like a bandaid for something that should (and was) working already. It is possible where I'm headed, but I just thought I'd reach out to the community to see if anyone has seen or could think of a reason why this might have started by moving from PG10 > PG12 and whether it was some possible regression.
Re: PG12 autovac issues
On Thu, Mar 19, 2020 at 5:35 PM Andres Freund wrote: > > Hi, > > On 2020-03-19 10:23:48 -0500, Justin King wrote: > > > From a single stats snapshot we can't actually understand the actual xid > > > consumption - is it actually the xid usage that triggers the vacuums? > > > > We have looked at this and the xid consumption averages around 1250 > > xid/sec -- this is when we see the "aggressive" autovac kick off in > > the logs. What I don't understand is why these xid's are being > > consumed at this rate on the databases with no activity (postgres, > > template1). > > The xid counter is global across all databases. Then what does the "age" value represent for each database in this case? Perhaps I'm misunderstanding what I'm looking at? postgres=# SELECT datname, age(datfrozenxid), current_setting('autovacuum_freeze_max_age') FROM pg_database; datname |age| current_setting ---+---+- postgres | 100937449 | 2 template1 | 50244438 | 2 template0 | 160207297 | 2 feedi | 150147602 | 2 > > > > What makes you think it is a problem that you have all these vacuums? If > > > you actually update that much, and you have indexes, you're going want a > > > lot of vacuums? > > > I actually don't think there's a problem with the vacuums (I was > > mostly pointing out that they are very regular and not problematic). > > The main problem I am having is that something is causing the > > autovacuums to completely stop and require manual intervention to > > resume -- and it seems to be when the "postgres" or "template1" > > database hits the autovacuum_freeze_max_age. > > Did you look at pg_stat_activity for those autovacuums to see whether > they're blocked on something? This is not something we've done yet but will next time it occurs. > > > > What is interesting is that this happens with the 'postgres' and > > > > 'template1' databases as well and there is absolutely no activity in > > > > those databases. > > > > > > That's normal. They should be pretty darn quick in v12? > > > > Yes, a manual VACUUM FREEZE of either database takes less than 1 > > second -- which is why it's perplexing that the autovac starts and > > never seems to complete and prevents other autovacs from running. > > One big difference between a manual VACUUM and autovacuum is that with > the default settings VACUUM is not throttled, but autovacuum is. > > What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay, > vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit, > vacuum_cost_page_miss set to? Here are all the vacuum related values for the server: postgres=# select name,setting from pg_settings where name like '%vacuum%'; autovacuum = on autovacuum_analyze_scale_factor = 0.1 autovacuum_analyze_threshold = 2500 autovacuum_freeze_max_age = 2 autovacuum_max_workers = 8 autovacuum_multixact_freeze_max_age = 4 autovacuum_naptime = 15 autovacuum_vacuum_cost_delay = 20 autovacuum_vacuum_cost_limit = -1 autovacuum_vacuum_scale_factor = 0.2 autovacuum_vacuum_threshold = 500 autovacuum_work_mem = -1 log_autovacuum_min_duration = 0 vacuum_cleanup_index_scale_factor = 0.1 vacuum_cost_delay = 0 vacuum_cost_limit = 1000 vacuum_cost_page_dirty = 20 vacuum_cost_page_hit = 1 vacuum_cost_page_miss = 10 vacuum_defer_cleanup_age = 0 vacuum_freeze_min_age = 5000 vacuum_freeze_table_age = 15000 vacuum_multixact_freeze_min_age = 500 vacuum_multixact_freeze_table_age = 15000 I know the database is busy, so the throttling makes sense, but it seems like it would complete eventually. We see blocked autovacs for many hours. > Greetings, > > Andres Freund
Re: PG12 autovac issues
On Thu, Mar 19, 2020 at 6:56 PM Andres Freund wrote: > > Hi, > > On 2020-03-19 18:07:14 -0500, Justin King wrote: > > On Thu, Mar 19, 2020 at 5:35 PM Andres Freund wrote: > > > > > > Hi, > > > > > > On 2020-03-19 10:23:48 -0500, Justin King wrote: > > > > > From a single stats snapshot we can't actually understand the actual > > > > > xid > > > > > consumption - is it actually the xid usage that triggers the vacuums? > > > > > > > > We have looked at this and the xid consumption averages around 1250 > > > > xid/sec -- this is when we see the "aggressive" autovac kick off in > > > > the logs. What I don't understand is why these xid's are being > > > > consumed at this rate on the databases with no activity (postgres, > > > > template1). > > > > > > The xid counter is global across all databases. > > > > Then what does the "age" value represent for each database in this > > case? Perhaps I'm misunderstanding what I'm looking at? > > > > postgres=# SELECT datname, age(datfrozenxid), > > current_setting('autovacuum_freeze_max_age') FROM pg_database; > > datname |age| current_setting > > ---+---+- > > postgres | 100937449 | 2 > > template1 | 50244438 | 2 > > template0 | 160207297 | 2 > > feedi | 150147602 | 2 > > Look at datfrozenxid without the age(). age(xid) computes how "old" xid > is compared to the "next" xid to be assigned. Until vacuum comes around > and performs work, pg_database.datfrozenxid / pg_class.relfrozenxid are > constant, since they represent the values actually present in the > table. But if xids are being consumed, their "age" increases, because > they're further and further in the past relative to the "newest" xids. > > > > > One big difference between a manual VACUUM and autovacuum is that with > > > the default settings VACUUM is not throttled, but autovacuum is. > > > > > > What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay, > > > vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit, > > > vacuum_cost_page_miss set to? > > > > Here are all the vacuum related values for the server: > > > > postgres=# select name,setting from pg_settings where name like '%vacuum%'; > > autovacuum = on > > autovacuum_analyze_scale_factor = 0.1 > > autovacuum_analyze_threshold = 2500 > > autovacuum_freeze_max_age = 2 > > autovacuum_max_workers = 8 > > autovacuum_multixact_freeze_max_age = 4 > > autovacuum_naptime = 15 > > autovacuum_vacuum_cost_delay = 20 > > autovacuum_vacuum_cost_limit = -1 > > autovacuum_vacuum_scale_factor = 0.2 > > autovacuum_vacuum_threshold = 500 > > autovacuum_work_mem = -1 > > log_autovacuum_min_duration = 0 > > vacuum_cleanup_index_scale_factor = 0.1 > > vacuum_cost_delay = 0 > > vacuum_cost_limit = 1000 > > vacuum_cost_page_dirty = 20 > > vacuum_cost_page_hit = 1 > > vacuum_cost_page_miss = 10 > > vacuum_defer_cleanup_age = 0 > > vacuum_freeze_min_age = 5000 > > vacuum_freeze_table_age = 15000 > > vacuum_multixact_freeze_min_age = 500 > > vacuum_multixact_freeze_table_age = 15000 > > > > I know the database is busy, so the throttling makes sense, but it > > seems like it would complete eventually. > > The cost limit/delay are way too long/small respectively for a busy > postgres instance. This does make sense, we will look into adjusting those values. > > We see blocked autovacs for many hours. > > On the same table, or just generally being busy? We haven't isolated *which* table it is blocked on (assuming it is), but all autovac's cease running until we manually intervene. When we get into this state again, is there some other information (other than what is in pg_stat_statement or pg_stat_activity) that would be useful for folks here to help understand what is going on? > > Greetings, > > Andres Freund
Re: PG12 autovac issues
On Mon, Mar 23, 2020 at 3:00 PM Andres Freund wrote: > > Hi, > > On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote: > > > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest > > > database > > > SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class > > > WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT 1; > > > > The vm has been trashed since, and I don't have that level of detail > > available > > in the gathered stats unfortunately (and the available information I have > > is a > > little bit messy, sorry for that). > > > > > - Oldest backend xmin > > > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE > > > backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3; > > > - oldest replication xmin: > > > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication > > > WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3; > > > - oldest slot xmin: > > > SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM > > > pg_replication_slots WHERE xmin <> 0 OR catalog_xmin <> 0 ORDER BY > > > greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3; > > > - oldest prepared transaction (unfortunately xmin not available) > > > SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY > > > age(transaction) LIMIT 3; > > > > I have output from this query: > > https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732 > > (which may be dumb). Here are the non-null rows after 8AM GMT+1: > > Could you share what the config of the server was? > > > > > > The same bench was run against pg11 many times and never triggered this > > > > issue. > > > > So far our best guess is a side effect of 2aa6e331ead7. > > > > > > In that case you'd likely see DEBUG1 output, right? Did you try running > > > with that? > > > > That's unfortunately not an option, as while the issue is reproducible, it > > happens after roughly 16h with an average of 6.5ktps, and we definitely > > don't > > have enough disk space for that amount of logs. We'll try next time with a > > recompiled version with the DEBUG1 promoted to WARNING, and maybe other > > elog. > > You wouldn't need to run it with DEBUG1 the whole time, you could just > change the config and reload config once you hit the issue. > > > > > > Michael and I have been trying to reproduce this issue locally > > > > (drastically > > > > reducing the various freeze_age parameters) for hours, but no luck for > > > > now. > > > > > > Have you considered using gdb to investigate? > > > > Not for now, because it's hard to be around for the ~ 90min interval the > > system > > stays alive after atuvacuum get stuck, especially since it's a weekend > > bench. > > > > > > This is using a vanilla pg 12.1, with some OLTP workload. The only > > > > possibly > > > > relevant configuration changes are quite aggressive autovacuum settings > > > > on some > > > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum > > > > scale > > > > factor to 0, for both heap and toast). > > > > > > That, uh, is not an insignificant set of changes for an autovac > > > scheduling issues. > > > > > > It sounds like what might be happening is that you have something > > > holding back the "oldest needed transaction" horizon. Before > > > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated > > > vacuums unable to increase the horizon, but afterwards they'll all > > > immediately exit without any messages. > > > > The xmin horizon is increasing AFAICS. Grepping the autovacuum logs, I can > > see > > for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from > > 200069684 to 210363848. E.g.: > > Which database is this on? > > > > [...] > > 266603:2020-03-14 08:49:59.927 UTC LOG: automatic aggressive vacuum to > > prevent wraparound of table "pg_toast.pg_toast_27683": index scans: 0 > > 266604- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped > > frozen > > 266605- tuples: 0 removed, 0 remain, 0 are dead but not yet > > removable, oldest xmin: 209635921 > > 266606- buffer usage: 25 hits, 1 misses, 1 dirtied > > 266607- avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s > > 266608- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: > > 0.00 s > > 266609:2020-03-14 08:49:59.929 UTC LOG: automatic aggressive vacuum to > > prevent wraparound of table "pg_toast.pg_toast_27694": index scans: 0 > > 266610- pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped > > frozen > > 266611- tuples: 0 removed, 0 remain, 0 are dead but not yet > > removable, oldest xmin: 209635934 > > 266612- buffer usage: 25 hits, 1 misses, 1 dirtied > > 266613- avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s > > 266614- system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: > > 0.00 s > > 266615:2020-03-14 08:49:59.931 UTC LOG: auto
Re: PG12 autovac issues
On Mon, Mar 23, 2020 at 4:31 PM Justin King wrote: > > On Mon, Mar 23, 2020 at 3:00 PM Andres Freund wrote: > > > > Hi, > > > > On 2020-03-23 20:47:25 +0100, Julien Rouhaud wrote: > > > > - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest > > > > database > > > > SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class > > > > WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT 1; > > > > > > The vm has been trashed since, and I don't have that level of detail > > > available > > > in the gathered stats unfortunately (and the available information I have > > > is a > > > little bit messy, sorry for that). > > > > > > > - Oldest backend xmin > > > > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity > > > > WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3; > > > > - oldest replication xmin: > > > > SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication > > > > WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3; > > > > - oldest slot xmin: > > > > SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) > > > > FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin <> 0 ORDER BY > > > > greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3; > > > > - oldest prepared transaction (unfortunately xmin not available) > > > > SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY > > > > age(transaction) LIMIT 3; > > > > > > I have output from this query: > > > https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732 > > > (which may be dumb). Here are the non-null rows after 8AM GMT+1: > > > > Could you share what the config of the server was? > > > > > > > > > The same bench was run against pg11 many times and never triggered > > > > > this issue. > > > > > So far our best guess is a side effect of 2aa6e331ead7. > > > > > > > > In that case you'd likely see DEBUG1 output, right? Did you try running > > > > with that? > > > > > > That's unfortunately not an option, as while the issue is reproducible, it > > > happens after roughly 16h with an average of 6.5ktps, and we definitely > > > don't > > > have enough disk space for that amount of logs. We'll try next time with > > > a > > > recompiled version with the DEBUG1 promoted to WARNING, and maybe other > > > elog. > > > > You wouldn't need to run it with DEBUG1 the whole time, you could just > > change the config and reload config once you hit the issue. > > > > > > > > > Michael and I have been trying to reproduce this issue locally > > > > > (drastically > > > > > reducing the various freeze_age parameters) for hours, but no luck > > > > > for now. > > > > > > > > Have you considered using gdb to investigate? > > > > > > Not for now, because it's hard to be around for the ~ 90min interval the > > > system > > > stays alive after atuvacuum get stuck, especially since it's a weekend > > > bench. > > > > > > > > This is using a vanilla pg 12.1, with some OLTP workload. The only > > > > > possibly > > > > > relevant configuration changes are quite aggressive autovacuum > > > > > settings on some > > > > > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum > > > > > scale > > > > > factor to 0, for both heap and toast). > > > > > > > > That, uh, is not an insignificant set of changes for an autovac > > > > scheduling issues. > > > > > > > > It sounds like what might be happening is that you have something > > > > holding back the "oldest needed transaction" horizon. Before > > > > 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated > > > > vacuums unable to increase the horizon, but afterwards they'll all > > > > immediately exit without any messages. > > > > > > The xmin horizon is increasing AFAICS. Grepping the autovacuum logs, I > > > can see > > > for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from > > >
Re: PG12 autovac issues
All- This started happening again. DEBUG1 is enabled: Mar 25 14:48:03 cowtn postgres[39720]: [35294-1] 2020-03-25 14:48:03.972 GMT [39720] DEBUG: autovacuum: processing database "template0" Mar 25 14:48:06 cowtn postgres[39735]: [35294-1] 2020-03-25 14:48:06.545 GMT [39735] DEBUG: autovacuum: processing database "postgres" Mar 25 14:48:11 cowtn postgres[39759]: [35294-1] 2020-03-25 14:48:11.284 GMT [39759] DEBUG: autovacuum: processing database "template1" Mar 25 14:48:14 cowtn postgres[39772]: [35294-1] 2020-03-25 14:48:14.564 GMT [39772] DEBUG: autovacuum: processing database "feedi" Mar 25 14:48:14 cowtn postgres[39772]: [35295-1] 2020-03-25 14:48:14.588 GMT [39772] LOG: automatic vacuum of table "feedi.pg_catalog.pg_statistic": index scans: 1 Mar 25 14:48:14 cowtn postgres[39772]: [35295-2] #011pages: 0 removed, 117 remain, 0 skipped due to pins, 0 skipped frozen Mar 25 14:48:14 cowtn postgres[39772]: [35295-3] #011tuples: 477 removed, 704 remain, 0 are dead but not yet removable, oldest xmin: 189591147 Mar 25 14:48:14 cowtn postgres[39772]: [35295-4] #011buffer usage: 315 hits, 0 misses, 0 dirtied Mar 25 14:48:14 cowtn postgres[39772]: [35295-5] #011avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s Mar 25 14:48:14 cowtn postgres[39772]: [35295-6] #011system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s Mar 25 14:48:15 cowtn postgres[39772]: [35296-1] 2020-03-25 14:48:15.606 GMT [39772] LOG: automatic vacuum of table "feedi.production.flightplans": index scans: 1 Mar 25 14:48:15 cowtn postgres[39772]: [35296-2] #011pages: 0 removed, 61331 remain, 0 skipped due to pins, 29772 skipped frozen Mar 25 14:48:15 cowtn postgres[39772]: [35296-3] #011tuples: 21807 removed, 509983 remain, 1581 are dead but not yet removable, oldest xmin: 189591147 Mar 25 14:48:15 cowtn postgres[39772]: [35296-4] #011buffer usage: 113684 hits, 0 misses, 1 dirtied Mar 25 14:48:15 cowtn postgres[39772]: [35296-5] #011avg read rate: 0.000 MB/s, avg write rate: 0.008 MB/s Mar 25 14:48:15 cowtn postgres[39772]: [35296-6] #011system usage: CPU: user: 0.64 s, system: 0.12 s, elapsed: 1.00 s Mar 25 14:48:16 cowtn postgres[39772]: [35297-1] 2020-03-25 14:48:16.537 GMT [39772] LOG: automatic analyze of table "feedi.production.flightplans" system usage: CPU: user: 0.83 s, system: 0.03 s, elapsed: 0.93 s Mar 25 14:48:16 cowtn postgres[39772]: [35298-1] 2020-03-25 14:48:16.627 GMT [39772] LOG: automatic vacuum of table "feedi.production.tita": index scans: 1 Mar 25 14:48:16 cowtn postgres[39772]: [35298-2] #011pages: 0 removed, 1711 remain, 0 skipped due to pins, 0 skipped frozen Mar 25 14:48:16 cowtn postgres[39772]: [35298-3] #011tuples: 5936 removed, 75280 remain, 1079 are dead but not yet removable, oldest xmin: 189591147 Mar 25 14:48:16 cowtn postgres[39772]: [35298-4] #011buffer usage: 5748 hits, 0 misses, 1 dirtied Mar 25 14:48:16 cowtn postgres[39772]: [35298-5] #011avg read rate: 0.000 MB/s, avg write rate: 0.100 MB/s Mar 25 14:48:16 cowtn postgres[39772]: [35298-6] #011system usage: CPU: user: 0.06 s, system: 0.00 s, elapsed: 0.07 s Mar 25 14:48:16 cowtn postgres[39772]: [35299-1] 2020-03-25 14:48:16.959 GMT [39772] LOG: automatic analyze of table "feedi.production.tita" system usage: CPU: user: 0.32 s, system: 0.00 s, elapsed: 0.33 s Mar 25 14:48:16 cowtn postgres[39772]: [35300-1] 2020-03-25 14:48:16.969 GMT [39772] LOG: automatic analyze of table "feedi.production.virtual_clocks" system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s Mar 25 14:48:18 cowtn postgres[39790]: [35294-1] 2020-03-25 14:48:18.975 GMT [39790] DEBUG: autovacuum: processing database "template0" Mar 25 14:48:21 cowtn postgres[39799]: [35294-1] 2020-03-25 14:48:21.546 GMT [39799] DEBUG: autovacuum: processing database "postgres" Mar 25 14:48:22 cowtn postgres[39853]: [35294-1] 2020-03-25 14:48:22.446 GMT [39853] LOG: connection received: host=10.4.4.11 port=25424 Mar 25 14:48:22 cowtn postgres[39853]: [35295-1] 2020-03-25 14:48:22.451 GMT [39853] LOG: connection authorized: user=feedi database=feedi SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off) Mar 25 14:48:26 cowtn postgres[39875]: [35294-1] 2020-03-25 14:48:26.277 GMT [39875] DEBUG: autovacuum: processing database "postgres" Mar 25 14:48:26 cowtn postgres[39875]: [35295-1] 2020-03-25 14:48:26.298 GMT [39875] DEBUG: skipping redundant vacuum to prevent wraparound of table "postgres.pg_catalog.pg_authid" Mar 25 14:48:26 cowtn postgres[39875]: [35296-1] 2020-03-25 14:48:26.309 GMT [39875] DEBUG: skipping redundant vacuum to prevent wraparound of table "postgres.pg_catalog.pg_subscription" Mar 25 14:48:26 cowtn postgres[39875]: [35297-1] 2020-03-25 14:48:26.319 GMT [39875] DEBUG: skipping redundant vacuum to prevent wraparound of table "postgres.pg_catalog.pg_database" Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25 14:48:26.329 GMT [39875] DEBUG: skipping redundant vacuum to prevent wraparound of table "postg
Re: PG12 autovac issues
On Wed, Mar 25, 2020 at 8:43 PM Michael Paquier wrote: > > On Wed, Mar 25, 2020 at 10:39:17AM -0500, Justin King wrote: > > This started happening again. DEBUG1 is enabled: > > Thanks for enabling DEBUG1 logs while this happened. > > > Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25 > > 14:48:26.329 GMT [39875] DEBUG: skipping redundant vacuum to prevent > > wraparound of table "postgres.pg_catalog.pg_tablespace" > > Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25 > > 14:48:26.339 GMT [39875] DEBUG: skipping redundant vacuum to prevent > > wraparound of table "postgres.pg_catalog.pg_auth_members" > > Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25 > > 14:48:26.350 GMT [39875] DEBUG: skipping redundant vacuum to prevent > > wraparound of table "postgres.pg_catalog.pg_replication_origin" > > Are you seeing such log entries happening multiple times for the same > relations, meaning that autovacuum workers are just looping on the > same relations all over again? This part of the logs point to catalog > tables, but are there other tables within your system facing the same > logs, particularly the database "feedi" with some of your own tables? Nope, it was just these tables that were looping over and over while nothing else was getting autovac'd. I'm happy to share the full log if you'd like. > > > postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM > > pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT > > 1; > > oid|age| relfrozenxid > > ---+---+-- > > pg_authid | 202793549 | 4284570172 > > Ugh. I think that this is exactly the thing I was suspecting > previously: > - The database stats look sane. > - The relation stats don't look good and visibly are put in such a > state that only one type of jobs gets triggered (non-aggressive but > anti-wraparound), which just keep being skipped and the relation stats > don't get refreshed. (Such autovacuum jobs should never happen and we > have some underlying issues that will need separate care). > > If you still have the cluster in this current state (perhaps you are > not able to keep it longer), could you provide more data about > pg_class.relfrozenxid for the tables which are mentioned in the logs > of the type "skipping redundant vacuum to prevent of table"? I did have to remove it from this state, but I can undo my workaround and, undoubtedly, it'll end up back there. Let me know if there's something specific you'd like me to provide when it happens! > > > Let me know if there's anything else useful I can provide. > > Thanks! > -- > Michael
Re: PG12 autovac issues
On Fri, Mar 27, 2020 at 12:12 AM Michael Paquier wrote: > > On Thu, Mar 26, 2020 at 09:46:47AM -0500, Justin King wrote: > > Nope, it was just these tables that were looping over and over while > > nothing else was getting autovac'd. I'm happy to share the full log > > if you'd like. > > Thanks, that could help. If that's very large, it could be a problem > to send that to the lists, but you could send me directly a link to > it and I'll try to extract more information for the lists. While > testing for reproducing the issue, I have noticed that basically one > set of catalog tables happened to see this "skipping redundant" log. > And I am wondering if we have a match with the set of catalog tables > looping. Sounds great. I will email you directly with a link! > > I did have to remove it from this state, but I can undo my workaround > > and, undoubtedly, it'll end up back there. Let me know if there's > > something specific you'd like me to provide when it happens! > > For now I think it's fine. Note that Julien and I have an environment > where the issue can be reproduced easily (it takes roughly 12 hours > until the wraparound cutoffs are reached with the benchmark and > settings used), and we are checking things using a patched instance > with 2aa6e33 reverted. I think that we are accumulating enough > evidence that this change was not a good idea anyway thanks to the > information you sent, so likely we'll finish first by a revert of > 2aa6e33 from the master and REL_12_STABLE branches, before looking at > the issues with the catalogs for those anti-wraparound and > non-aggressive jobs (this looks like a relcache issue with the so-said > catalogs). This is encouraging. As I mentioned, we have a workaround in place for the moment, but don't hesitate if you need anything else from me. Thanks for jumping in on the thread, it was nice validation to know that I wasn't the only one seeing the issue! > -- > Michael
walreceiver termination
We've seen unexpected termination of the WAL receiver process. This stops streaming replication, but the replica stays available -- restarting the server resumes streaming replication where it left off. We've seen this across nearly every recent version of PG, (9.4, 9.5, 11.x, 12.x) -- anything omitted is one we haven't used. I don't have an explanation for the cause, but I was able to set logging to "debug5" and run an strace of the walrecevier PID when it eventually happened. It appears as if the SIGTERM is coming from the "postgres: startup" process. The only notable thing that I've seen discussion of (in the past) and where it might be relevant here is that these replicas all reside on a ZFS filesystem (on Linux). In fact, everywhere we've seen issues, the database sits on a ZFS pool. We also have another PG cluster running on non-ZFS (ext4) and have never experienced this problem. Any suggestions? Logging snippets are below: postgresql.log Apr 23 09:09:13 royce postgres[2229]: [43141904-1] 2020-04-23 09:09:13.663 GMT [2229] DEBUG: record known xact 8677704 latestObservedXid 8677704 Apr 23 09:09:13 royce postgres[2229]: [43141904-2] 2020-04-23 09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A6E0 for Btree/INSERT_LEAF: off 88 Apr 23 09:09:13 royce postgres[2229]: [43141905-1] 2020-04-23 09:09:13.663 GMT [2229] DEBUG: record known xact 8677704 latestObservedXid 8677704 Apr 23 09:09:13 royce postgres[2229]: [43141905-2] 2020-04-23 09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A720 for Heap/HEAP_CONFIRM: off 6 Apr 23 09:09:13 royce postgres[2229]: [43141906-1] 2020-04-23 09:09:13.663 GMT [2229] DEBUG: record known xact 8677704 latestObservedXid 8677704 Apr 23 09:09:13 royce postgres[2229]: [43141906-2] 2020-04-23 09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A750 for Heap/INSERT: off 7 flags 0x0C Apr 23 09:09:13 royce postgres[2229]: [43141907-1] 2020-04-23 09:09:13.663 GMT [2229] LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 Apr 23 09:09:13 royce postgres[2299]: [1448555-1] 2020-04-23 09:09:13.663 GMT [2299] FATAL: terminating walreceiver process due to administrator command Apr 23 09:09:13 royce postgres[2299]: [1448556-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG: shmem_exit(1): 1 before_shmem_exit callbacks to make Apr 23 09:09:13 royce postgres[2299]: [1448557-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG: shmem_exit(1): 6 on_shmem_exit callbacks to make Apr 23 09:09:13 royce postgres[2229]: [43141908-1] 2020-04-23 09:09:13.664 GMT [2229] DEBUG: switched WAL source from stream to archive after failure Apr 23 09:09:13 royce postgres[2229]: [43141909-1] 2020-04-23 09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 Apr 23 09:09:13 royce postgres[2299]: [1448558-1] 2020-04-23 09:09:13.664 GMT [2299] DEBUG: proc_exit(1): 2 callbacks to make Apr 23 09:09:13 royce postgres[2229]: [43141910-1] 2020-04-23 09:09:13.664 GMT [2229] DEBUG: switched WAL source from archive to stream after failure Apr 23 09:09:13 royce postgres[2229]: [43141911-1] 2020-04-23 09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 Apr 23 09:09:13 royce postgres[2299]: [1448559-1] 2020-04-23 09:09:13.664 GMT [2299] DEBUG: exit(1) Apr 23 09:09:13 royce postgres[2229]: [43141912-1] 2020-04-23 09:09:13.664 GMT [2229] DEBUG: switched WAL source from stream to archive after failure Apr 23 09:09:13 royce postgres[2229]: [43141913-1] 2020-04-23 09:09:13.664 GMT [2229] DEBUG: incorrect resource manager data checksum in record at 2D6/C259AB90 Apr 23 09:09:13 royce postgres[2229]: [43141914-1] 2020-04-23 09:09:13.664 GMT [2229] DEBUG: switched WAL source from archive to stream after failure Apr 23 09:09:13 royce postgres[2299]: [1448560-1] 2020-04-23 09:09:13.664 GMT [2299] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make Apr 23 09:09:13 royce postgres[2229]: [43141915-1] 2020-04-23 09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum in record at 2D process id list: 2224 ?Ss 0:24 /usr/lib/postgresql/12/bin/postgres -D /var/lib/postgresql/12/main -c unix_socket_directories=/var/run/postgresql -c config_file=/etc/postgresql/12/main/postgresql.conf -c hba_file=/etc/postgresql/12/main/pg_hba.conf -c ident_file=/etc/postgresql/12/main/pg_ident.conf -c external_pid_file=/var/run/postgresql/12-main.pid 2229 ?Ss 7:02 postgres: startup 2230 ?Ss 6:09 postgres: checkpointer 2231 ?Ss 0:08 postgres: background writer 2298 ?Ss 1:29 postgres: stats collector 2299 ?Ss 6:11 postgres: walreceiver strace of walreceiver: kill(2229, SIGUSR1) = 0 getpid()= 2299 sendto(3, "<135>Apr 23 09:09:13 postgres[2299]: [1448554-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG: sending write 2D6/C25A flush 2D6/C25A apply 2D6/C25915F0", 156, MSG_NOSIGNAL, NULL, 0)
Re: walreceiver termination
On Thu, Apr 23, 2020 at 12:47 PM Tom Lane wrote: > > Justin King writes: > > We've seen unexpected termination of the WAL receiver process. This > > stops streaming replication, but the replica stays available -- > > restarting the server resumes streaming replication where it left off. > > We've seen this across nearly every recent version of PG, (9.4, 9.5, > > 11.x, 12.x) -- anything omitted is one we haven't used. > > > I don't have an explanation for the cause, but I was able to set > > logging to "debug5" and run an strace of the walrecevier PID when it > > eventually happened. It appears as if the SIGTERM is coming from the > > "postgres: startup" process. > > The startup process intentionally SIGTERMs the walreceiver under > various circumstances, so I'm not sure that there's any surprise > here. Have you checked the postmaster log? > > regards, tom lane Yep, I included "debug5" output of the postmaster log in the initial post.
Re: walreceiver termination
I assume it would be related to the following: LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 since the walreceiver terminates just after this - but I'm unclear what precisely this means. Without digging into the code, I would guess that it's unable to verify the checksum on the segment it just received from the master; however, there are multiple replicas here, so it points to an issue on this client. However, it happens everywhere -- we have ~16 replicas across 3 different clusters (on different versions) and we see this uniformly across them all at seemingly random times. Also, just to clarify, this will only happen on a single replica at a time. On Thu, Apr 23, 2020 at 2:46 PM Justin King wrote: > > On Thu, Apr 23, 2020 at 12:47 PM Tom Lane wrote: > > > > Justin King writes: > > > We've seen unexpected termination of the WAL receiver process. This > > > stops streaming replication, but the replica stays available -- > > > restarting the server resumes streaming replication where it left off. > > > We've seen this across nearly every recent version of PG, (9.4, 9.5, > > > 11.x, 12.x) -- anything omitted is one we haven't used. > > > > > I don't have an explanation for the cause, but I was able to set > > > logging to "debug5" and run an strace of the walrecevier PID when it > > > eventually happened. It appears as if the SIGTERM is coming from the > > > "postgres: startup" process. > > > > The startup process intentionally SIGTERMs the walreceiver under > > various circumstances, so I'm not sure that there's any surprise > > here. Have you checked the postmaster log? > > > > regards, tom lane > > Yep, I included "debug5" output of the postmaster log in the initial post.
Re: walreceiver termination
On Thu, Apr 23, 2020 at 3:06 PM Tom Lane wrote: > > Justin King writes: > > I assume it would be related to the following: > > LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 > > since the walreceiver terminates just after this - but I'm unclear > > what precisely this means. > > What it indicates is corrupt data in the WAL stream. When reading WAL > after crash recovery, we assume that that indicates end of WAL. When > pulling live data from a source server, it suggests some actual problem > ... but killing the walreceiver and trying to re-fetch the data might > be a reasonable response to that. I'm not sure offhand what the startup > code thinks it's doing in this context. It might either be attempting > to retry, or concluding that it's come to the end of WAL and it ought > to promote to being a live server. If you don't see the walreceiver > auto-restarting then I'd suspect that the latter is happening. > > regards, tom lane walrecevier is definitely not restarting -- replication stops cold right at that segment. I'm a little unclear where to go from here -- is there additional info that would be useful?
Re: walreceiver termination
Would there be anyone that might be able to help troubleshoot this issue -- or at least give me something that would be helpful to look for? https://www.postgresql.org/message-id/flat/CAGH8ccdWLLGC7qag5pDUFbh96LbyzN_toORh2eY32-2P1%3Dtifg%40mail.gmail.com https://www.postgresql.org/message-id/flat/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com https://dba.stackexchange.com/questions/116569/postgresql-docker-incorrect-resource-manager-data-checksum-in-record-at-46f-6 I'm not the first one to report something similar and all the complaints have a different filesystem in common -- particularly ZFS (or btrfs, in the bottom case). Is there anything more we can do here to help narrow down this issue? I'm happy to help, but I honestly wouldn't even know where to begin. Thanks- Justin King flightaware.com On Thu, Apr 23, 2020 at 4:40 PM Justin King wrote: > > On Thu, Apr 23, 2020 at 3:06 PM Tom Lane wrote: > > > > Justin King writes: > > > I assume it would be related to the following: > > > LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 > > > since the walreceiver terminates just after this - but I'm unclear > > > what precisely this means. > > > > What it indicates is corrupt data in the WAL stream. When reading WAL > > after crash recovery, we assume that that indicates end of WAL. When > > pulling live data from a source server, it suggests some actual problem > > ... but killing the walreceiver and trying to re-fetch the data might > > be a reasonable response to that. I'm not sure offhand what the startup > > code thinks it's doing in this context. It might either be attempting > > to retry, or concluding that it's come to the end of WAL and it ought > > to promote to being a live server. If you don't see the walreceiver > > auto-restarting then I'd suspect that the latter is happening. > > > > regards, tom lane > > walrecevier is definitely not restarting -- replication stops cold > right at that segment. I'm a little unclear where to go from here -- > is there additional info that would be useful?