Error on vacuum: xmin before relfrozenxid

2018-05-22 Thread Paolo Crosato
ile
 syslog_facility  | local0
 | configuration file
 syslog_ident | postgres
 | configuration file
 TimeZone | Europe/Rome
  | configuration file
 track_activity_query_size| 2048
 | configuration file
 track_functions  | all
  | configuration file
 track_io_timing  | on
 | configuration file
 wal_keep_segments| 200
  | configuration file
 wal_level| logical
  | configuration file
(51 rows)



Operating system and version:

CentOS Linux release 7.4.1708 (Core)
Linux xx.xx.com 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

What program you're using to connect to PostgreSQL:

psql, jdbc driver. It's not an application side error.

Is there anything relevant or unusual in the PostgreSQL server logs?:

Yes, we see these kind of errors:

May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client=
ERROR:  found xmin 2889675859 from before relfrozenxid 400011439
May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client=
CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_authid"

When these errors pop up, autovacuum repeatedly fails on this very same
table, and we have to resort to manual vacuums

For questions about any kind of error:
What you were doing when the error happened / how to cause the error:

The first time we saw the error, we found an idle transaction was left open
for several days. The user was not monitored for long running transactions.
We killed the session, deleted the offending row (the one with the xmin
value reported) and then vacuumed the table.
However, teh same error pops up every time we try to add a new user. We
would like to fix these error once and for all, so we can add new users.

The EXACT TEXT of the error message you're getting, if there is one: (Copy
and paste the message to the email, do not send a screenshot)

This is the error:

psql (10.4)
Type "help" for help.

postgres=# vacuum pg_authid;
ERROR:  found xmin 3031994631 from before relfrozenxid 400011439

Some datas:

postgres=# select xmin from pg_authid ;
xmin

  1
  1
  1
  1
  1
557
7216348
  110077819
  110511334
 3031994631
 3032044199
 3032044199
 3032044199
 3032070282
(14 rows)

postgres=# select relfrozenxid from pg_class where relname='pg_authid';
 relfrozenxid
--
400011439
(1 row)

postgres=#

Is this a sympthom of data corruption or transaction wraparound due to the
long running transaction that we killed weeks ago? This is the only table
in the whole cluster that has this error. We are monitoring transactions
wraparound with the the check_postgres.pl script, the check is still
running fine and no alert was given at all since the cluster has been
running.

Thank you in advance for any answer.

Paolo Crosato


Re: Error on vacuum: xmin before relfrozenxid

2018-05-22 Thread Paolo Crosato
Hello,


2018-05-22 18:49 GMT+02:00 Andres Freund :

> Hi,
>
> On 2018-05-22 16:18:20 +0200, Paolo Crosato wrote:
> > PostgreSQL version number you are running:
> >
> > PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> > 20150623 (Red Hat 4.8.5-28), 64-bit
> >
> > How you installed PostgreSQL:
> >
> > From the pgdg yum repositories.
> >
> > Changes made to the settings in the postgresql.conf file:  see Server
> > Configuration for a quick way to list them all.
> > May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client=
> > ERROR:  found xmin 2889675859 from before relfrozenxid 400011439
> > May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client=
> > CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_authid"
> >
> > postgres=# select xmin from pg_authid ;
> > xmin
> > 
> >   1
> >   1
> >   1
> >   1
> >   1
> > 557
> > 7216348
> >   110077819
> >   110511334
> >  3031994631
> >  3032044199
> >  3032044199
> >  3032044199
> >  3032070282
> > (14 rows)
> >
> > postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> >  relfrozenxid
> > --
> > 400011439
> > (1 row)
> >
> > postgres=#
> >
> > Is this a sympthom of data corruption or transaction wraparound due to
> the
> > long running transaction that we killed weeks ago?
>
> That shouldn't be the sole cause.
>
>
> > This is the only table in the whole cluster that has this error. We
> > are monitoring transactions wraparound with the the check_postgres.pl
> > script, the check is still running fine and no alert was given at all
> > since the cluster has been running.
> >
> > Thank you in advance for any answer.
>
> Could you report the result of
> select ctid, xmin, xmax from pg_authid ;
>

This is the result:

postgres=# select ctid, xmin, xmax from pg_authid ;
  ctid  |xmin| xmax
++--
 (0,1)  |  1 |0
 (0,2)  |  1 |0
 (0,3)  |  1 |0
 (0,4)  |  1 |0
 (0,5)  |  1 |0
 (0,6)  |557 |0
 (0,11) |7216348 |0
 (0,12) |  110077819 |0
 (0,13) |  110511334 |0
 (0,16) | 3031994631 |0
 (0,17) | 3032044199 |0
 (0,18) | 3032044199 |0
 (0,19) | 3032044199 |0
 (0,20) | 3032070282 |0
(14 rows)

postgres=#


> and
> CREATE EXTENSION pageinspect;
> SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));
>

This is the output of the second query:

postgres=# CREATE EXTENSION pageinspect;
CREATE EXTENSION
postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));
 lp | lp_off | lp_flags | lp_len |   t_xmin   | t_xmax | t_field3 | t_ctid
| t_infomask2 | t_infomask | t_hoff |  t_bits  |  t_oid  |
   t_data
++--++++--++-+++--+-+--
  1 |   8080 |1 |108 |  1 |  0 |0 | (0,1)
|  11 |   2313 | 32 | 1000 |3373 |
\x70675f6d6f6e69746f720001
  2 |   7968 |1 |108 |  1 |  0 |0 | (0,2)
|  11 |   2313 | 32 | 1000 |3374 |
\x70675f726561645f616c6c5f73657474696e67730001
  3 |   7856 |1 |108 |  1 |  0 |0 | (0,3)
|  11 |   2313 | 32 | 1000 |3375 |
\x70675f726561645f616c6c5f737461747301
  4 |   7744 |1 |108 |  1 |  0 |0 | (0,4)
|  11 |   2313 | 32 | 1000 |3377 |
\x70675f737461745f7363616e5f7461626c657301
  5 |   7632 |1 |108 | 

Re: Error on vacuum: xmin before relfrozenxid

2018-05-22 Thread Paolo Crosato
Good evening,

2018-05-22 23:19 GMT+02:00 Andres Freund :

> Hi,
>
> On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > > Could you report the result of
> > > select ctid, xmin, xmax from pg_authid ;
> > >
> >
> > This is the result:
> >
> > postgres=# select ctid, xmin, xmax from pg_authid ;
>
> >  (0,16) | 3031994631 |0
> >  16 |   6496 |1 |144 | 3031994631 |  0 |0 |
> (0,16)
> > |   32779 |  10507 | 32 | 1100 |  675851 |
> > \x6e6167696f73000
> 
> 10100496d64356333633236616163636
> 439616665346437383061396239613464663634653639
>
> > postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> >  relfrozenxid
> > --
> > 400011439
>
> That tuple looks, to me, like it indeed shouldn't exist, given the
> relfrozenxid.  Decoding infomask (10507 / 0x290B), if I did it right,
> shows:
>
> HEAP_HASNULL
> HEAP_HASVARWIDTH
> HEAP_HASOID
> HEAP_XMIN_COMMITTED
> HEAP_XMAX_INVALID
> HEAP_UPDATED
>
> so it's not frozen.  That suggests there's either a bug, or you had
> corruption in your cluster.
>
>
> Could you give a bit of "history" about that postgres instance? What
> version of postgres was it run on earliest? Was it ever pg_upgraded?
> Were there any OS crashes? Other hardware issues?  Was the cluster ever
> used with fsync=off or full_page_writes=off?
>
> Greetings,
>
> Andres Freund
>

The cluster is made of a primary master instance, and a secondary slave in
hot standby, with streaming replication. There is a barman server for the
backups.
The first version it ran on was 10.2, the cluster was promoted in
production in mid february. It was never pg_upgraded, we upgraded it
yesterday to 10.4 and restarted, in hope that this would resolve the issue.
The restart was performed by the upgrading process. There was never any
crash or hardware issue, the instance run without any problem. we never
enabled fsync=off or full_page_writes=off. This is the only real issue so
far. We have many nagios alerts monitoring the instance, the only problem
is we are creating many temporary files, most of them are caused by a query
that doesn't overflow work_mem when run alone with explain analyze buffers,
which is weird. I should mention that we had a major creation of temp files
in the first hours of the cluster history, however that was quickly
resolved.

I managed to recover the log of the first time we run into the issue, the
error was the same but on template1:

May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
found xmin 2600758304 from before relfrozenxid 400011439
May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
automatic vacuum of table "template1.pg_catalog.pg_authid"

Deleting the row worked, but, as you see, the problem with  400011439 was
there already. This happened when we added a user, shortly afterwards these
errors started popping up in the log. We found the issue because the
autovacuums stopped working, they were blocked on the pg_authid table. The
same time we found out the long running transaction, it was a multi session
kill that someway got stuck. Killing the transaction and deleting the
offending row was enough for the autovacuums to restart. We also had a
materialized view that was created months before and nobody used. Since i
found in the past there was a similar error related to materialized views,
I dropped it.

The instance is doing a lot of transactions every day, probably around 100
millions. I would need to do more checks to provide an accurate measure.

I still have the logs since the first time the error appeared, so I can
provide further details if needed.

Thanks,
Best Regards

Paolo Crosato


Re: Error on vacuum: xmin before relfrozenxid

2018-05-23 Thread Paolo Crosato
2018-05-23 1:39 GMT+02:00 Andres Freund :

> Hi,
>
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> >
> > May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
>
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error.  At which rate are you
> creating / updating database users / roles?
>

Once or twice a month. Yesterday I updated some rows on the table with a

update pg_auth_id set rolname=rolname where rolname=...

for some users in the hope to work around the issue, but it didn't work.
That's why there are recent xmin on some rows.

Best Regards,

Paolo Crosato


Re: Error on vacuum: xmin before relfrozenxid

2018-05-23 Thread Paolo Crosato
2018-05-23 20:32 GMT+02:00 Andres Freund :

> On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> > Hi,
> >
> > On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > > I managed to recover the log of the first time we run into the issue,
> the
> > > error was the same but on template1:
> > >
> > > May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > > found xmin 2600758304 from before relfrozenxid 400011439
> > > May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client=
> CONTEXT:
> > > automatic vacuum of table "template1.pg_catalog.pg_authid"
> >
> > pg_authid (along with a few other tables) is shared between
> > databases. So that's just hte same error.  At which rate are you
> > creating / updating database users / roles?
>
> Other questions:
> - did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
>   tables)?
> - Did you have any failovers?
> - Do you use logical replication?
>

1) VACUUM FULL was issued after the first time the error occurred, and a
couple of times later. CLUSTER was never run.
2) Several failovers tests were perfomed before the cluster was moved to
production. However, before the move, the whole cluster was wiped,
including all the application and monitoring users. After the db was moved
to production, a couple of users were added without any problem.
3) No, even if the replication level is set to logical in postgresql.conf,
we only use streaming replication.

Best Regards,

Paolo Crosato


Re: Error on vacuum: xmin before relfrozenxid

2018-05-24 Thread Paolo Crosato
2018-05-24 8:30 GMT+02:00 Andrey Borodin :

> Hi!
>
> > 24 мая 2018 г., в 0:55, Paolo Crosato 
> написал(а):
> >
> > 1) VACUUM FULL was issued after the first time the error occurred, and a
> couple of times later. CLUSTER was never run.
> > 2) Several failovers tests were perfomed before the cluster was moved to
> production. However, before the move, the whole cluster was wiped,
> including all the application and monitoring users. After the db was moved
> to production, a couple of users were added without any problem.
> > 3) No, even if the replication level is set to logical in
> postgresql.conf, we only use streaming replication.
>
> I've encountered seemingly similar ERROR:
> [ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR:  found xmin
> 747375134 from before relfrozenxid 2467346321
> [ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT:  automatic vacuum
> of table "postgres.pg_catalog.pg_database"
>
> Table pg_database, probably, was not changed anyhow for a long period of
> database exploitation.
> Unfortunately, I've found out this only there were million of xids left
> and had to vacuum freeze db in single user mode asap. But, probably, I will
> be able to restore database from backups and inspect it, if necessary.
> Though first occurrence of this error was beyond recovery window.
>
> Best regards, Andrey Borodin.


I could build a mirror instance with barman and see if the issue is present
as well, then try to vacuum freeze it in single mode, and see if it
disappears; but I would like to know why it happened in the first time. I
wonder if the autovacuum settings played a role, we kept the defaults, even
if the instance has a very heavy update workload.

Best Regards,

Paolo Crosato