Corrupt index stopping autovacuum system wide

2019-07-17 Thread Aaron Pelz
Hello,

Over the weekend we noticed that our max transaction IDs have been
continuously increasing - got an alert it passed 1B - and we believe that
no autovacuums were running for a period of about a month by looking at
pg_stat_user_tables. We had not updated any autovac tuning parameters over
that time period and many tables were very much over the threshold for
needing an autovac.

When investigating I located the table with the oldest transaction ID with:

SELECT c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age,
pg_size_pretty(pg_table_size(c.oid)) as table_size,
c.relkind,
c.relpersistence
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind in ('r', 't','m')
order by age desc limit 40;

I vacuumed that table manually with `vacuum freeze verbose table_xx` and
got this error:

INFO:  aggressively vacuuming "public.table_xx"
INFO:  scanned index "table_xx_pkey" to remove 168443 row versions
DETAIL:  CPU: user: 0.13 s, system: 0.00 s, elapsed: 0.13 s
INFO:  scanned index "table_xx_col_id" to remove 168443 row versions
DETAIL:  CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s
ERROR:  failed to re-find parent key in index "table_xx_col_idx" for
deletion target page 217

I replaced the index (create a new concurrently, delete the old
concurrently), vacuumed the table, and immediately autovacs started across
the system and our XIDs started falling. To me it looks like a *single*
corrupt index held up autovacuums across our entire server, even other in
other databases on the same server. Am I interpreting this correctly? Would
love guidance on diagnosing this type of thing and strategies for
preventing it.

Thanks,
Aaron


Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Aaron Pelz
> What PostgreSQL version are you on? Was this an INCLUDE index on
PostgreSQL 11?

On 11, and no it was just a normal btree.

> It's possible that amcheck would have given you an accurate diagnosis
> of the problem -- especially if you used bt_index_parent_check():

I'll look into this, seems helpful. Thanks!

On Wed, Jul 17, 2019 at 12:21 PM Peter Geoghegan  wrote:

> On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz  wrote:
> > To me it looks like a *single* corrupt index held up autovacuums across
> our entire server, even other in other databases on the same server. Am I
> interpreting this correctly?
>
> Yes -- that is correct.
>
> What PostgreSQL version are you on? Was this an INCLUDE index on
> PostgreSQL 11?
>
> > Would love guidance on diagnosing this type of thing and strategies for
> preventing it.
>
> It's possible that amcheck would have given you an accurate diagnosis
> of the problem -- especially if you used bt_index_parent_check():
>
> https://www.postgresql.org/docs/current/amcheck.html
> --
> Peter Geoghegan
>


Re: Corrupt index stopping autovacuum system wide

2019-07-18 Thread Aaron Pelz
> Can you tell us more about this index? Can you share its definition
> (i.e. what does \d show in psql)?

> Is it an expression index, or a partial index? A composite? What
> datatypes are indexed?

It's a simple btree expression on a geometry(Point,4326) , no expression no
partial no composite.

On Wed, Jul 17, 2019 at 3:58 PM Peter Geoghegan  wrote:

> On Wed, Jul 17, 2019 at 10:27 AM Peter Geoghegan  wrote:
> > > It's possible that amcheck would have given you an accurate diagnosis
> > > of the problem -- especially if you used bt_index_parent_check():
> > >
> > > https://www.postgresql.org/docs/current/amcheck.html
> >
> > BTW, be sure to use the 'heapallindexed' option with
> > bt_index_parent_check() to detect missing downlinks, which is exactly
> > the problem that VACUUM complained about.
>
> Can you tell us more about this index? Can you share its definition
> (i.e. what does \d show in psql)?
>
> Is it an expression index, or a partial index? A composite? What
> datatypes are indexed?
>
> Thanks
> --
> Peter Geoghegan
>