Corrupt index stopping autovacuum system wide
Hello, Over the weekend we noticed that our max transaction IDs have been continuously increasing - got an alert it passed 1B - and we believe that no autovacuums were running for a period of about a month by looking at pg_stat_user_tables. We had not updated any autovac tuning parameters over that time period and many tables were very much over the threshold for needing an autovac. When investigating I located the table with the oldest transaction ID with: SELECT c.oid::regclass as table_name, greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age, pg_size_pretty(pg_table_size(c.oid)) as table_size, c.relkind, c.relpersistence FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind in ('r', 't','m') order by age desc limit 40; I vacuumed that table manually with `vacuum freeze verbose table_xx` and got this error: INFO: aggressively vacuuming "public.table_xx" INFO: scanned index "table_xx_pkey" to remove 168443 row versions DETAIL: CPU: user: 0.13 s, system: 0.00 s, elapsed: 0.13 s INFO: scanned index "table_xx_col_id" to remove 168443 row versions DETAIL: CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s ERROR: failed to re-find parent key in index "table_xx_col_idx" for deletion target page 217 I replaced the index (create a new concurrently, delete the old concurrently), vacuumed the table, and immediately autovacs started across the system and our XIDs started falling. To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other databases on the same server. Am I interpreting this correctly? Would love guidance on diagnosing this type of thing and strategies for preventing it. Thanks, Aaron
Re: Corrupt index stopping autovacuum system wide
> What PostgreSQL version are you on? Was this an INCLUDE index on PostgreSQL 11? On 11, and no it was just a normal btree. > It's possible that amcheck would have given you an accurate diagnosis > of the problem -- especially if you used bt_index_parent_check(): I'll look into this, seems helpful. Thanks! On Wed, Jul 17, 2019 at 12:21 PM Peter Geoghegan wrote: > On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz wrote: > > To me it looks like a *single* corrupt index held up autovacuums across > our entire server, even other in other databases on the same server. Am I > interpreting this correctly? > > Yes -- that is correct. > > What PostgreSQL version are you on? Was this an INCLUDE index on > PostgreSQL 11? > > > Would love guidance on diagnosing this type of thing and strategies for > preventing it. > > It's possible that amcheck would have given you an accurate diagnosis > of the problem -- especially if you used bt_index_parent_check(): > > https://www.postgresql.org/docs/current/amcheck.html > -- > Peter Geoghegan >
Re: Corrupt index stopping autovacuum system wide
> Can you tell us more about this index? Can you share its definition > (i.e. what does \d show in psql)? > Is it an expression index, or a partial index? A composite? What > datatypes are indexed? It's a simple btree expression on a geometry(Point,4326) , no expression no partial no composite. On Wed, Jul 17, 2019 at 3:58 PM Peter Geoghegan wrote: > On Wed, Jul 17, 2019 at 10:27 AM Peter Geoghegan wrote: > > > It's possible that amcheck would have given you an accurate diagnosis > > > of the problem -- especially if you used bt_index_parent_check(): > > > > > > https://www.postgresql.org/docs/current/amcheck.html > > > > BTW, be sure to use the 'heapallindexed' option with > > bt_index_parent_check() to detect missing downlinks, which is exactly > > the problem that VACUUM complained about. > > Can you tell us more about this index? Can you share its definition > (i.e. what does \d show in psql)? > > Is it an expression index, or a partial index? A composite? What > datatypes are indexed? > > Thanks > -- > Peter Geoghegan >