Heikki Linnakangas <[email protected]> writes:
> After my commit c532d15ddd to split up copy.c, buildfarm animal "prion"
> failed in pg_upgrade
> (https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2020-11-23%2009%3A23%3A16):
prion's continued to fail, rarely, in this same way:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-09%2009%3A13%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-15%2004%3A08%3A06
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-05-12%2018%3A43%3A14
The failures are remarkably identical, and they also look a lot like
field reports we've been seeing off and on for years. I do not know
why it always seems to be pg_toast_2619 (i.e. pg_statistic) that's
affected, but the pattern is pretty undeniable by now.
What I do have that's new is that *I can reproduce it*, at long last.
For me, installing the attached patch and running pg_upgrade's
"make check" fails, pretty much every time, with symptoms identical
to prion's.
The patch consists of
(1) 100ms delay just before detoasting, when loading a pg_statistic
catcache entry that has toasted datums
(2) provisions to mark such catcache entries dead immediately
(effectively, CATCACHE_FORCE_RELEASE for only these tuples);
this is to force us through (1) as often as possible
(3) some quick-hack debugging aids added to HeapTupleSatisfiesToast,
plus convert the missing-chunk error to PANIC to get a stack
trace.
If it doesn't reproduce for you, try adjusting the delay. 100ms
was the first value I tried, though, so I think it's probably
not too sensitive.
The trace I'm getting shows pretty positively that autovacuum
has fired on pg_statistic, and removed the needed toast entries,
just before the failure. So something is wrong with our logic
about when toast entries can be removed.
I do not have a lot of idea why, but I see something that is
probably a honkin' big clue:
2021-05-15 17:28:05.965 EDT [2469444] LOG: HeapTupleSatisfiesToast: xmin 2
t_infomask 0x0b02
That is, the toast tuples in question are not just frozen, but
actually have xmin = FrozenTransactionId.
I do not think that is normal --- at least, it's not the state
immediately after initdb, and I can't make it happen with
"vacuum freeze pg_statistic". A plausible theory is that pg_upgrade
caused this to happen (but how?) and then vacuuming of toast rows
goes off the rails because of it.
Anyway, I have no more time to poke at this right now, so I'm
posting the reproducer in case someone else wants to look at it.
regards, tom lane
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index d3c57cd16a..5830df83c2 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -73,11 +73,14 @@
#include "access/xlog.h"
#include "storage/bufmgr.h"
#include "storage/procarray.h"
+#include "tcop/tcopprot.h"
#include "utils/builtins.h"
#include "utils/combocid.h"
#include "utils/snapmgr.h"
+bool trace_toast_visibility = false;
+
/*
* SetHintBits()
*
@@ -366,6 +369,13 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
Assert(ItemPointerIsValid(&htup->t_self));
Assert(htup->t_tableOid != InvalidOid);
+ if (trace_toast_visibility)
+ ereport(LOG,
+ errmsg("HeapTupleSatisfiesToast: xmin %u t_infomask 0x%04x",
+ HeapTupleHeaderGetXmin(tuple),
+ tuple->t_infomask),
+ debug_query_string ? 0 : errbacktrace());
+
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
@@ -420,6 +430,11 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
return false;
}
+ if (trace_toast_visibility)
+ elog(LOG, "HeapTupleSatisfiesToast: xmin %u t_infomask 0x%04x OK",
+ HeapTupleHeaderGetXmin(tuple),
+ tuple->t_infomask);
+
/* otherwise assume the tuple is valid for TOAST. */
return true;
}
diff --git a/src/backend/access/heap/heaptoast.c b/src/backend/access/heap/heaptoast.c
index 55bbe1d584..03ea2dc80d 100644
--- a/src/backend/access/heap/heaptoast.c
+++ b/src/backend/access/heap/heaptoast.c
@@ -781,7 +781,7 @@ heap_fetch_toast_slice(Relation toastrel, Oid valueid, int32 attrsize,
* Final checks that we successfully fetched the datum
*/
if (expectedchunk != (endchunk + 1))
- ereport(ERROR,
+ ereport(PANIC,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg_internal("missing chunk number %d for toast value %u in %s",
expectedchunk, valueid,
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 4fbdc62d8c..a1f923e02c 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -23,6 +23,7 @@
#include "access/xact.h"
#include "catalog/pg_collation.h"
#include "catalog/pg_operator.h"
+#include "catalog/pg_statistic.h"
#include "catalog/pg_type.h"
#include "common/hashfn.h"
#include "miscadmin.h"
@@ -39,6 +40,7 @@
#include "utils/resowner_private.h"
#include "utils/syscache.h"
+extern bool trace_toast_visibility;
/* #define CACHEDEBUG */ /* turns DEBUG elogs on */
@@ -1810,6 +1812,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
{
CatCTup *ct;
HeapTuple dtp;
+ bool makedead = false;
MemoryContext oldcxt;
/* negative entries have no tuple associated */
@@ -1827,7 +1830,16 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
* something using a slightly stale catcache entry.
*/
if (HeapTupleHasExternal(ntp))
+ {
+ if (cache->cc_reloid == StatisticRelationId)
+ {
+ pg_usleep(100000);
+ makedead = true;
+ trace_toast_visibility = true;
+ }
dtp = toast_flatten_tuple(ntp, cache->cc_tupdesc);
+ trace_toast_visibility = false;
+ }
else
dtp = ntp;
@@ -1887,7 +1899,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
ct->my_cache = cache;
ct->c_list = NULL;
ct->refcount = 0; /* for the moment */
- ct->dead = false;
+ ct->dead = makedead;
ct->negative = negative;
ct->hash_value = hashValue;