Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2018-04-05 Thread Adam Sjøgren
Tom writes:

> a...@novozymes.com (Adam =?utf-8?Q?Sj=C3=B8gren?=) writes:
>> Also, the error we are getting is now: "unexpected chunk number 2
>> (expected 3) for toast value 1498303849 in pg_toast_10919630", where
>> previously we've only seen "unexpected chunk number 0 (expected 1)".
>
>> We are kind of at a loss, so any suggestions on what we could try are
>> welcome.
>
> The basic thrust of these messages is "I'm reading what should be
> sequentially numbered data chunks for this toast OID, and the sequence
> numbers are wrong".  Both of these instances could be explained by
> duplicate toast rows (or duplicate index entries pointing at one row),
> though of course that would just move to the next question of how it
> got that way.  Anyway, you could move the investigation along with
> some manual checking into what's in that toast table.  For instance
>
> select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data)
>   from pg_toast.pg_toast_10919630
>   where chunk_id = 1498303849
>   order by 1,2;
>
> might be informative.  If you do see what seem to be duplicate
> chunk_seq values, checking whether they're still there in a
> seqscan would be good.

Here's a statement which currently gives an unexpected chunk error:

  efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '189909908';
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 1698936148 in 
pg_toast_10919630

And when I run the suggested query, I get:

  efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, 
length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 
order by 1,2;
chunk_id  | chunk_seq | ctid |xmin| xmax | length 
  +---+--++--+
   1698936148 | 0 | (52888694,2) | 1511390221 |0 |   1996
   1698936148 | 1 | (52888694,4) | 1511390221 |0 |   1148
  (2 rows)

How would I go about checking if they are still in a seqscan?

(Note: this is on PostgreSQL 9.3.22.)


  Best regards,

Adam

-- 
 "No more than that, but very powerful all theAdam Sjøgren
  same; simple things are good."a...@novozymes.com




Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2018-04-05 Thread Adam Sjøgren
Adam writes:

> Here's a statement which currently gives an unexpected chunk error:
>
>   efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '189909908';
>   ERROR:  unexpected chunk number 0 (expected 1) for toast value 1698936148 
> in pg_toast_10919630
>
> And when I run the suggested query, I get:
>
>   efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, 
> length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 
> 1698936148 order by 1,2;
> chunk_id  | chunk_seq | ctid |xmin| xmax | length 
>   +---+--++--+
>1698936148 | 0 | (52888694,2) | 1511390221 |0 |   1996
>1698936148 | 1 | (52888694,4) | 1511390221 |0 |   1148
>   (2 rows)

More examples:

  efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '237764759';
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 1698958350 in 
pg_toast_10919630
  efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, 
length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698958350 
order by 1,2;
chunk_id  | chunk_seq | ctid |xmin| xmax | length 
  +---+--++--+
   1698958350 | 0 | (54859821,2) | 1511487270 |0 |   1448
  (1 row)

And:

  efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '366275833';
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 1698945095 in 
pg_toast_10919630
  efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, 
length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698945095 
order by 1,2;
chunk_id  | chunk_seq | ctid |xmin| xmax | length 
  +---+--++--+
   1698945095 | 0 | (53706565,3) | 1511426847 |0 |   1996
   1698945095 | 1 | (53706565,6) | 1511426847 |0 |108
  (2 rows)

One more:

  efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '189909908';
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 1698936148 in 
pg_toast_10919630
  efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, 
length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 
order by 1,2;
chunk_id  | chunk_seq | ctid |xmin| xmax | length 
  +---+--++--+
   1698936148 | 0 | (52888694,2) | 1511390221 |0 |   1996
   1698936148 | 1 | (52888694,4) | 1511390221 |0 |   1148
  (2 rows)

And here is one from another table:

  efamroot@kat efam=# SELECT * FROM efam.sequence WHERE id = '235887163';
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 1698750544 in 
pg_toast_10919630
  efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, 
length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698750544 
order by 1,2;
chunk_id  | chunk_seq | ctid |xmin| xmax | length 
  +---+--++--+
   1698750544 | 0 | (39575142,3) | 1510704835 |0 |   1996
   1698750544 | 1 | (39575142,4) | 1510704835 |0 |716
  (2 rows)

Let me know what other relevant info I can provide.


  Best regards,

Adam

-- 
 "No more than that, but very powerful all theAdam Sjøgren
  same; simple things are good."a...@novozymes.com




Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2018-04-05 Thread Adam Sjøgren
Tom writes:

>> And when I run the suggested query, I get:
>
>>   efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, 
>> length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 
>> 1698936148 order by 1,2;
>> chunk_id  | chunk_seq | ctid |xmin| xmax | length 
>>   +---+--++--+
>>1698936148 | 0 | (52888694,2) | 1511390221 |0 |   1996
>>1698936148 | 1 | (52888694,4) | 1511390221 |0 |   1148
>>   (2 rows)

> Hmph.  So if you EXPLAIN that query, does it show it's doing it as an
> indexscan?  I'd expect so, but it's always good to make sure.

It does:

  efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '189909908';
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 1698936148 in 
pg_toast_10919630
  efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, 
length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 
order by 1,2;
chunk_id  | chunk_seq | ctid |xmin| xmax | length 
  +---+--++--+
   1698936148 | 0 | (52888694,2) | 1511390221 |0 |   1996
   1698936148 | 1 | (52888694,4) | 1511390221 |0 |   1148
  (2 rows)

  efamroot@kat efam=# explain select chunk_id, chunk_seq, ctid, xmin, xmax, 
length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 
order by 1,2;
QUERY PLAN  

  
--
   Index Scan using pg_toast_10919630_index on pg_toast_10919630  
(cost=0.57..2627179.25 rows=2135674 width=54)
 Index Cond: (chunk_id = 1698936148::oid)
  (2 rows)

> Assuming it does say that, then the other test I had in mind would
> involve "set enable_indexscan = 0", then repeat the EXPLAIN to make
> sure that you now get a seqscan plan (you might need to turn off
> enable_bitmapscan too), then do the query again and see whether the
> results are the same.

Ok (if I don't disable bitmapscan, I get Bitmap Heap Scans in the EXPLAIN, so):

  efamroot@kat efam=# set enable_indexscan = 0;
  SET
  efamroot@kat efam=# set enable_bitmapscan = 0;
  SET
  efamroot@kat efam=# explain select chunk_id, chunk_seq, ctid, xmin, xmax, 
length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 
order by 1,2;
   QUERY PLAN   
   
  
-
   Sort  (cost=96465280.57..96470619.75 rows=2135674 width=54)
 Sort Key: chunk_seq
 ->  Seq Scan on pg_toast_10919630  (cost=0.00..96240754.39 rows=2135674 
width=54)
   Filter: (chunk_id = 1698936148::oid)
  (4 rows)

  efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, 
length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 
order by 1,2;
  [... still waiting for the result, I will return with what it said
   when the server does ...]


  Best regards,

Adam

-- 
 "No more than that, but very powerful all theAdam Sjøgren
  same; simple things are good."a...@novozymes.com




Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2018-04-05 Thread Adam Sjøgren
Adam writes:

>   efamroot@kat efam=# explain select chunk_id, chunk_seq, ctid, xmin, xmax, 
> length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 
> 1698936148 order by 1,2;
>QUERY PLAN 
>  
>   
> -
>Sort  (cost=96465280.57..96470619.75 rows=2135674 width=54)
>  Sort Key: chunk_seq
>  ->  Seq Scan on pg_toast_10919630  (cost=0.00..96240754.39 rows=2135674 
> width=54)
>Filter: (chunk_id = 1698936148::oid)
>   (4 rows)
>
>   efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, 
> length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 
> 1698936148 order by 1,2;
>   [... still waiting for the result, I will return with what it said
>when the server does ...]

It did eventually finish, with the same result:

  efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, 
length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 
order by 1,2;
chunk_id  | chunk_seq | ctid |xmin| xmax | length 
  +---+--++--+
   1698936148 | 0 | (52888694,2) | 1511390221 |0 |   1996
   1698936148 | 1 | (52888694,4) | 1511390221 |0 |   1148
  (2 rows)


  Best regards,

Adam

-- 
 "No more than that, but very powerful all theAdam Sjøgren
  same; simple things are good."a...@novozymes.com




Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2018-04-10 Thread Adam Sjøgren
Pavan writes:

>> FWIW one of our support customers reported a very similar TOAST table
>> corruption issue last week which nearly caused an outage. After a lot of
>> analysis, I think I've now fully understood the reasons behind the
>> corruption, the underlying bug(s) and possible remedy. I am currently
>> working on writing a reproducible test case to demonstrate the problem and
>> writing the fix. More details on that soon.
>
> I've posted a reproducer and a proposed fix to -hackers [1]
>
> In the particular case that I investigated, a database crash/recovery was
> involved. But I think we should be able to create a scenario where OID
> wrap-around or a standby promotion triggers the problem.  I don't know if
> any of that was involved in the cases reported on this thread, but I've a
> strong suspicion that the underlying bug is probably the same.

In our case there was no crash+recovery, but we do have a high write
load (and the problem occurring quite seldom), so it sounds like it
fits.

> [1] 
> https://www.postgresql.org/message-id/CABOikdOgWT2hHkYG3Wwo2cyZJq2zfs1FH0FgX-%3Dh4OLosXHf9w%40mail.gmail.com

Impressive investigation!

Just a couple of data points: your script to reproduce the problem does
so for all three versions of PostgreSQL we have in use: 9.3.22, 9.6.8
and 10.3.

And I specifically tested on our production machine which exhibits the
problem, running 9.3.22, and your script reproduced the problem there as
well.

I.e. ending with:

  ERROR:  unexpected chunk number 0 (expected 1) for toast value 16396 in 
pg_toast_16384
  REINDEX
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 16396 in 
pg_toast_16384
  VACUUM
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 16396 in 
pg_toast_16384


  Best regards,

    Adam

-- 
 "No more than that, but very powerful all theAdam Sjøgren
  same; simple things are good."a...@novozymes.com




Database logins taking longer and longer, showing up as "authentication" in ps(1)

2020-08-18 Thread Adam Sjøgren
  Hi,


We have a PostgreSQL 11.3¹ running on an Ubuntu 16.04 server, which
sometimes exhibits a behaviour I can't quite understand: simply logging
into the database starts to take minutes to complete.

We have 60 processes (workers) running on different machines accessing
the database, that all grab jobs from a queue and update rows in a table
after doing some calculations (which vary in time from <1s to perhaps a
minute, many of them fast).

Sometimes new database logins slow down, from usually taking <0.05s to
taking minutes. This is for psql as a normal user using Kerberos, for
psql as the postgres superuser, for the web-application logging into the
database, for everything.

What I see in the output of ps(1) is a lot of postgres processes in
state "authentication", when this problem occurs. It "feels" like the
logins are waiting for something - when e.g. psql finally logs in,
running queries seem normal.

The only "special" thing I can think of is that the workers use advisory
locks on rows in the table they are updating (as there can be multiple
jobs in flight at the same time, for the same row).

I can imagine that a pattern of jobs and workers could make the locking
cause those workers to wait and slow each other down, but I don't
understand how it can affect the time it takes for new database logins.

When this happens, if I stop the 60 workers, the login times are back to
normal within seconds, and all the postgres processes showing
"authentication" switch away from that state. If I start the workers
again, the login problem appears within a couple of minutes.

If I start a smaller number of workers, say 5, the problem doesn't
appear.

Could this be lock contention on a database table somehow affecting
logins?

I have tried searching for this problem, and found some older references
to things² that seem to have been fixed in the meantime. Is anybody else
seeing this?

The server is configured with max_connections 1200 and currently has
~400 connections. It has 1.5 TB of memory and 112 cores (including
hyperthreading) - the load is not high when I see this problem, between
20 and 30.

Any ideas and suggestions are welcome - if there is some relevant
information I can provide, I will be happy to try.


  Best regards,

Adam


¹ We should plan to upgrade to 11.9, yes.

² E.g. 
https://www.postgresql.org/message-id/CAJw8uJQGoobesbPCMbxj6Vb4nv9D-GgvZ%2B7pK%2Bfckbb4DqJEAg%40mail.gmail.com
   and 
https://www.postgresql.org/message-id/20140312232810.634.7997%40wrigleys.postgresql.org

-- 
 "We don't care about crashing our cars Adam Sjøgren
  Pedal to the metal and our heads among the stars"a...@koldfront.dk





Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)

2020-08-18 Thread Adam Sjøgren
Laurenz writes:

> On Tue, 2020-08-18 at 14:53 +0200, Adam Sjøgren wrote:
>>
>> We have 60 processes (workers) running on different machines accessing
>> the database, that all grab jobs from a queue and update rows in a table
>> after doing some calculations (which vary in time from <1s to perhaps a
>> minute, many of them fast).
>> 
>> Sometimes new database logins slow down, from usually taking <0.05s to
>> taking minutes. This is for psql as a normal user using Kerberos, for
>> psql as the postgres superuser, for the web-application logging into the
>> database, for everything.
>
> I would suspect that the problem is with the authentication method.

Interesting, how are you suspecting the interaction to be playing out?

> How is authenticatoin configured?

Primarily using gss (for the web-application, delegated credentials),
md5 (system users), and a few trust (for legacy access).

> I'd "strace" a connection that is hanging in "authenticating" and
> see what the process does.

I did that one of the previous times I have seen this problem, and saw
stuff like this:

  2020-06-10 11:22:14strace: Process 101559 attached
  2020-06-10 11:22:14futex(0x7f46df29e138, 
FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ) = 0
...
  2020-06-10 11:24:09futex(0x7f46df29e138, 
FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ) = 0
  2020-06-10 11:24:09futex(0x7f46df2a2e38, FUTEX_WAKE, 1)= 1
  2020-06-10 11:24:09brk(0x55c197084000) = 
0x55c197084000
  2020-06-10 11:24:09brk(0x55c1970b) = 
0x55c1970b
  2020-06-10 11:24:09open("global/pg_filenode.map", O_RDONLY) = 6

(where the elided part is futex() calls).

The slow login is regardless of whether I login with Kerberos or as the
postgres superuser with psql.


  Best regards,

Adam

-- 
 "Sometimes I feel like going home, walking in the  Adam Sjøgren
  boardrooma...@koldfront.dk
  Sometimes I feel like I wanna go, walk into the
  ballroom"




Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)

2020-08-18 Thread Adam Sjøgren
Stephen writes:

> * Adam Sjøgren (a...@koldfront.dk) wrote:
>> 
>> We have 60 processes (workers) running on different machines accessing
>> the database, that all grab jobs from a queue and update rows in a table
>> after doing some calculations (which vary in time from <1s to perhaps a
>> minute, many of them fast).
>> 
>> Sometimes new database logins slow down, from usually taking <0.05s to
>> taking minutes. This is for psql as a normal user using Kerberos, for
>> psql as the postgres superuser, for the web-application logging into the
>> database, for everything.
>
> When in doubt, blame DNS.

I'd love to! However I don't see any DNS outages on our local network
correlating with whether I run 60 workers or 5 workers.

> Alternatively, in your case, the issue might be the KDC taking forever
> to issue a ticket for the service.

If that was the cause, logging in as the 'postgres' superuser (not using
Kerberos) locally on the server should be fast regardless, right?

> (though you might check if you have log_hostnames on..).

It's off:

  $ grep hostname /etc/postgresql/11/main/postgresql.conf 
  #log_hostname = off


Thanks for replying!


  Best regards,

Adam

-- 
 "Probabilistic algorithms don't appeal to me. (ThisAdam Sjøgren
  is a question of aesthetics, not practicality.) So   a...@koldfront.dk
  later, I figured out how to remove the probability
  and turn it into a deterministic algorithm."




Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)

2020-08-19 Thread Adam Sjøgren
Tom writes:

> Stephen Frost  writes:
>> * Adam Sjøgren (a...@koldfront.dk) wrote:
>>> Sometimes new database logins slow down, from usually taking <0.05s to
>>> taking minutes. This is for psql as a normal user using Kerberos, for
>>> psql as the postgres superuser, for the web-application logging into the
>>> database, for everything.
>
> I think looking into pg_stat_activity and pg_locks might be interesting,
> although it's possible the delay is before the new session has made
> any entries there.
>
> Also, obtaining stack traces from a few of the stuck processes would
> likely be quite informative.

I will try to look at pg_stat_activity and pg_locks the next time - it
just happened now and I focused on getting some stacktraces -
unfortunately I do not have the -dbg package installed, so I don't know
how useful they are.

Here's the first one:

  $ ps faux | grep postgres | grep authentication
  postgres  16697  0.0  0.0 34686088 8352 ?   Ss   14:45   0:00  \_ 
postgres: 11/main: workqueue XXdb 127.0.0.1(42534) authentication
  postgres  16701  0.0  0.0 34686088 6476 ?   Ss   14:45   0:00  \_ 
postgres: 11/main: anonymous XXdb 127.0.0.1(42536) authentication
  postgres  16702  0.0  0.0 34686088 8348 ?   Ss   14:45   0:00  \_ 
postgres: 11/main: svc_user XXdb 127.0.0.1(42540) authentication
  postgres  16704  0.0  0.0 34686040 5376 ?   Ss   14:45   0:00  \_ 
postgres: 11/main: postgres postgres [local] authentication
  postgres  16705  0.0  0.0 34686088 8352 ?   Ss   14:45   0:00  \_ 
postgres: 11/main: svc_user XXdb 127.0.0.1(42542) authentication
  postgres  16708  0.0  0.0 34686088 6476 ?   Ss   14:45   0:00  \_ 
postgres: 11/main: svc_user XXdb 127.0.0.1(42544) authentication
  $ sudo gdb -p  16708
  GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
[...]
  Attaching to process 16708
  Reading symbols from /usr/lib/postgresql/11/bin/postgres...(no debugging 
symbols found)...done.
[...]
  0x7f8270a039f3 in delete_module () at 
../sysdeps/unix/syscall-template.S:86
  86../sysdeps/unix/syscall-template.S: No such file or directory.
  (gdb) bt
  #0  0x7f8270a039f3 in delete_module () at 
../sysdeps/unix/syscall-template.S:86
  #1  0x55f60d3d0101 in WaitEventSetWait ()
  #2  0x55f60d2dccff in secure_read ()
  #3  0x55f60d2e9138 in ?? ()
  #4  0x55f60d2e9e75 in pq_getbyte ()
  #5  0x55f60d3f3100 in PostgresMain ()
  #6  0x55f60d102bf8 in ?? ()
  #7  0x55f60d37e492 in PostmasterMain ()
  #8  0x55f60d103fa5 in main ()
  (gdb) 

I think I was too slow on this one, and this is what it looks like if
the process is just waiting for a connection (as you can guess, I'm not
used to looking at backtraces).

Here is another one, which hopefully is caught at a better moment, and
more useful:

  postgres  19495  0.0  0.0 34686088 8412 ?   Ss   14:48   0:00  \_ 
postgres: 11/main: workqueue XXdb 127.0.0.1(43390) authentication
  postgres  19497  0.0  0.0 34686088 6476 ?   Ss   14:48   0:00  \_ 
postgres: 11/main: XXX XXdb 127.0.0.1(43392) authentication
  postgres  19502  0.0  0.0 34686088 8412 ?   Ss   14:48   0:00  \_ 
postgres: 11/main: XXX XXdb 127.0.0.1(43394) authentication
  $ sudo gdb -p  19495
  GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
[...]
  Attaching to process 19495
[...]
  0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, 
abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at 
../sysdeps/unix/sysv/linux/futex-internal.h:205
  205   ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
  (gdb) bt
  #0  0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, 
abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at 
../sysdeps/unix/sysv/linux/futex-internal.h:205
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at 
sem_waitcommon.c:111
  #2  0x7f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, 
abstime=0x0) at sem_waitcommon.c:181
  #3  0x7f8272cf297a in __new_sem_wait (sem=) at 
sem_wait.c:29
  #4  0x55f60d36c252 in PGSemaphoreLock ()
  #5  0x55f60d3e35cc in LWLockAcquire ()
  #6  0x55f60d3d14ee in ProcArrayAdd ()
  #7  0x55f60d3e01a3 in InitProcessPhase2 ()
  #8  0x55f60d51c99b in InitPostgres ()
  #9  0x55f60d3f2baf in PostgresMain ()
  #10 0x55f60d102bf8 in ?? ()
  #11 0x55f60d37e492 in PostmasterMain ()
  #12 0x55f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, 
abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at 
../sysdeps/unix/sysv/linux/futex-internal.h:205
  205   in ../sysdeps/unix/sysv/linux/futex-internal.h
  (gdb) bt
  #0  0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, 
abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at 
../sysdeps/unix/sysv/linu

Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)

2020-08-19 Thread Adam Sjøgren
elable (private=128, 
abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at 
../sysdeps/unix/sysv/linux/futex-internal.h:205
  205   in ../sysdeps/unix/sysv/linux/futex-internal.h
  (gdb) bt
  #0  0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, 
abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at 
../sysdeps/unix/sysv/linux/futex-internal.h:205
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at 
sem_waitcommon.c:111
  #2  0x7f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, 
abstime=0x0) at sem_waitcommon.c:181
  #3  0x7f8272cf297a in __new_sem_wait (sem=) at 
sem_wait.c:29
  #4  0x55f60d36c252 in PGSemaphoreLock ()
  #5  0x55f60d3e35cc in LWLockAcquire ()
  #6  0x55f60d3d14ee in ProcArrayAdd ()
  #7  0x55f60d3e01a3 in InitProcessPhase2 ()
  #8  0x55f60d51c99b in InitPostgres ()
  #9  0x55f60d3f2baf in PostgresMain ()
  #10 0x55f60d102bf8 in ?? ()
  #11 0x55f60d37e492 in PostmasterMain ()
  #12 0x55f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, 
abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at 
../sysdeps/unix/sysv/linux/futex-internal.h:205
  205   in ../sysdeps/unix/sysv/linux/futex-internal.h
  (gdb) bt
  #0  0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, 
abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at 
../sysdeps/unix/sysv/linux/futex-internal.h:205
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at 
sem_waitcommon.c:111
  #2  0x7f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, 
abstime=0x0) at sem_waitcommon.c:181
  #3  0x7f8272cf297a in __new_sem_wait (sem=) at 
sem_wait.c:29
  #4  0x55f60d36c252 in PGSemaphoreLock ()
  #5  0x55f60d3e35cc in LWLockAcquire ()
  #6  0x55f60d3d14ee in ProcArrayAdd ()
  #7  0x55f60d3e01a3 in InitProcessPhase2 ()
  #8  0x55f60d51c99b in InitPostgres ()
  #9  0x55f60d3f2baf in PostgresMain ()
  #10 0x55f60d102bf8 in ?? ()
  #11 0x55f60d37e492 in PostmasterMain ()
  #12 0x55f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, 
abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at 
../sysdeps/unix/sysv/linux/futex-internal.h:205
  205   in ../sysdeps/unix/sysv/linux/futex-internal.h
  (gdb) bt
  #0  0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, 
abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at 
../sysdeps/unix/sysv/linux/futex-internal.h:205
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at 
sem_waitcommon.c:111
  #2  0x7f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, 
abstime=0x0) at sem_waitcommon.c:181
  #3  0x7f8272cf297a in __new_sem_wait (sem=) at 
sem_wait.c:29
  #4  0x55f60d36c252 in PGSemaphoreLock ()
  #5  0x55f60d3e35cc in LWLockAcquire ()
  #6  0x55f60d3d14ee in ProcArrayAdd ()
  #7  0x55f60d3e01a3 in InitProcessPhase2 ()
  #8  0x55f60d51c99b in InitPostgres ()
  #9  0x55f60d3f2baf in PostgresMain ()
  #10 0x55f60d102bf8 in ?? ()
  #11 0x55f60d37e492 in PostmasterMain ()
  #12 0x55f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x7f8270a039f3 in delete_module () at 
../sysdeps/unix/syscall-template.S:86
  86../sysdeps/unix/syscall-template.S: No such file or directory.
  (gdb) bt
  #0  0x7f8270a039f3 in delete_module () at 
../sysdeps/unix/syscall-template.S:86
  #1  0x55f60d3d0101 in WaitEventSetWait ()
  #2  0x55f60d2dccff in secure_read ()
  #3  0x55f60d2e9138 in ?? ()
  #4  0x55f60d2e9e75 in pq_getbyte ()
  #5  0x55f60d3f3100 in PostgresMain ()
  #6  0x55f60d102bf8 in ?? ()
  #7  0x55f60d37e492 in PostmasterMain ()
  #8  0x55f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x7f8270a039f3 in delete_module () at 
../sysdeps/unix/syscall-template.S:86
  86in ../sysdeps/unix/syscall-template.S
  (gdb) detach
  Detaching from program: /usr/lib/postgresql/11/bin/postgres, process 79572
  (gdb) quit

I hope this is useful. Let me know if there are any parameters I should
try adjusting, or some other information I should try to extract. Thanks!


  Best regards,

Adam

-- 
 "Why do you ask so much?"  Adam Sjøgren
 "I'm a kid. It's what we do." a...@koldfront.dk




Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)

2020-08-20 Thread Adam Sjøgren
Tom writes:

> Frequently, stack traces taken without debug symbols are pretty useless,
> but you seem to have got lucky.

For once, hah!

> It seems clear that the unexpected wait is here:
>
>>   (gdb) bt
>>   #0  0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, 
>> abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at 
>> ../sysdeps/unix/sysv/linux/futex-internal.h:205
>>   #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at 
>> sem_waitcommon.c:111
>>   #2  0x7f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, 
>> abstime=0x0) at sem_waitcommon.c:181
>>   #3  0x7f8272cf297a in __new_sem_wait (sem=) at 
>> sem_wait.c:29
>>   #4  0x55f60d36c252 in PGSemaphoreLock ()
>>   #5  0x55f60d3e35cc in LWLockAcquire ()
>>   #6  0x55f60d3d14ee in ProcArrayAdd ()
>>   #7  0x55f60d3e01a3 in InitProcessPhase2 ()
>>   #8  0x55f60d51c99b in InitPostgres ()
>>   #9  0x55f60d3f2baf in PostgresMain ()
>>   #10 0x55f60d102bf8 in ?? ()
>>   #11 0x55f60d37e492 in PostmasterMain ()
>>   #12 0x55f60d103fa5 in main ()
>
> which implies that the problem is unexpectedly high contention for the
> ProcArrayLock.  It's unsurprising that incoming processes would need
> that lock, since they need to add themselves to the ProcArray.  What's
> not so apparent is why there is so much activity as to block them
> from doing so for a significant amount of time.  We can guess that
> the "worker processes" you mentioned are ganging up on that lock
> because they need to compute query snapshots, but that still doesn't
> seem like enough of an explanation.
>
> Can you show us exactly what the "worker processes" are doing while
> this is going on?

One thing I should have mentioned, but forgot, is that the database is
configured to do logical replication to another machine - could that
have an effect on the lock contention?

A colleague pointed out that in the pg_locks output, the replication
processes are waiting on the ProcArray lock:

 · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz

I can try turning off replication the next time I see the problem.

The workers basically loop on:

 * SELECT a job from a job-table, UPDATE it
 * Take an advisory lock on the relevant row in the calculations table
 * Do some work, at least a couple of SELECTs from various tables
 * UPDATE or INSERT a row into the calculations table, release the lock
 * INSERT a row in an exitstatus table
 * DELETE the job

The calculations can be sub-second to several minutes, the majority
mostly in the shorter end. As this is happening there are millions of
jobs in the job-table.

When the problem occurs, the worker logfiles indicate that they are
moving along, still working.

To give a sense of scale, here are the number of transactions per second
plotted:

 · 
https://koldfront.dk/misc/postgresql/postgres_transactions_ALL-pinpoint_1597799383-1597907383.png

(I think the hole in the graph is when Munin timed out getting the
information, and the number of transactions dropped as I stopped most of
the workers.)


  Best regards,

Adam

-- 
 "Examination and mastering of a new highly Adam Sjøgren
  intellectual equipment was a hard labour."   a...@koldfront.dk




Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)

2020-09-30 Thread Adam Sjøgren
Tom writes:

> =?utf-8?Q?Adam_Sj=C3=B8gren?=  writes:
>> Tom writes:
>>> ... which implies that the problem is unexpectedly high contention for the
>>> ProcArrayLock.
>
>> One thing I should have mentioned, but forgot, is that the database is
>> configured to do logical replication to another machine - could that
>> have an effect on the lock contention?
>> A colleague pointed out that in the pg_locks output, the replication
>> processes are waiting on the ProcArray lock:
>>  · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz
>
> Yeah, that is *mighty* interesting.  For the archives' sake, I'll point
> out that this shows 16 walsender processes, of which 6 are in
> WalSenderWaitForWAL waits (ie, basically idle) and all of the other 10
> are waiting for the ProcArrayLock.

Sorry for not returning on this before now.

I have seen the slow login problem - ProcArrayLock contention - happen
with replication turned off, so I think that can be ruled out as the
cause.

I just observed something interesting right now, however.

Just to recap the scene: the basic setup is that we have a table with a
queue of jobs to be processed, and we have a number of programs (say 10)
on 6 other machines grabbing those jobs, doing calculations and
reporting back.

The number of jobs at any given time can fluctuate from none to
millions. Typically millions of jobs take some days to a week to get
through.

Now, I happened to have a psql prompt open when the problem of new
logins being slow appeared - and I ran a query that counted the number
of jobs by job-type.

Before the slow login problem, that query took around 70-80 ms, but when
the slow login appeared, it took 20+ seconds to run the query.

EXPLAIN ANALYZE showed that it was running a parallel query.

So I tried turning parallel queries off - with SET
max_parallel_workers_per_gather = 0; - and now the query took around 150
ms consistenly, despite logins still being (variably) slow.

So that confirms the ProcArrayLock contention, I think.

My next thought is: I have 6*10 programs making connections and doing
parallel queries over the job queue, to get new jobs. If the jobs happen
to be very "easy", those programs will be doing a lot of queries for
jobs. And when the queue is large enough (I'm guessing), those queries
will be parallel, so they will be creating new processes that need to be
added and removed from the ProcArray.

So am I shooting myself in the foot here, by creating lock contention
when trying to get jobs quickly from a largeish table?

I think my next step will be to modify the programs that grab the jobs
to avoid parallel queries when looking up the next job to run.

You anticipated this, it just took me until now to get a little closer:

> It's also evident that there's some parallel query activity going on,
> as some sessions are in BgWorkerShutdown waits; so parallel worker
> entry/exit might be contributing to the overall load on ProcArrayLock.


  Best regards,

Adam

-- 
 "That's one of the remarkable things about life.   Adam Sjøgren
  It's never so bad that it can't get worse."  a...@koldfront.dk




Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)

2020-09-30 Thread Adam Sjøgren
Adrian writes:

> I don't have an answer. Not even sure if this is relevant to the
> problem, but how are the jobs getting into the queue?

Plain INSERTs - often a lot at the same time. I inserted 400K jobs to
clean up after a bug earlier today, for instance.

What were you suspecting?


  Best regards,

Adam

-- 
 "Don't think twice     Adam Sjøgren
  It's all right"  a...@koldfront.dk




Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2018-01-16 Thread Adam Sjøgren
We are seeing these "ERROR:  unexpected chunk number 0 (expected 1) for
toast value 1498303849 in pg_toast_10919630" in increasing numbers again¹.

An observation is that they seem to only happen for tsvector fields.

Here is an example sequence of queries for a record (we have more than a
handful of these currently), which exhibits the problem.

First we get two other fields, 'sequence' is large enough to be toast'ed:

  2018-01-16 08:51:17.362 efam=# select id,sequence from efam.sequence where id 
= 164504550;
  Time: 1.150 ms

No problem.

Then we also fetch the tsvector field:

  2018-01-16 08:51:27.773 efam=# select id,sequence,fts from efam.sequence 
where id = 164504550;
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 1498303849 in 
pg_toast_10919630
  Time: 0.912 ms

And we get the error.

Getting the id and the tsvector:

  2018-01-16 08:51:34.174 efam=# select id,fts from efam.sequence where id = 
164504550;
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 1498303849 in 
pg_toast_10919630
  Time: 6.138 ms

gives the error.

Just getting the tsvector:

  2018-01-16 08:51:40.066 efam=# select fts from efam.sequence where id = 
164504550;
  ERROR:  unexpected chunk number 0 (expected 1) for toast value 1498303849 in 
pg_toast_10919630
  Time: 1.805 ms

Gives the error.

Field definitions:

   id| integer | 
   sequence  | text| 
   fts   | tsvector| 

Maybe the observation that this only happens (for us) on tsvector rings
a bell for someone?


  Best regards,

Adam


¹ As reported back in June, 2017, starting here:
  https://www.postgresql.org/message-id/7pefuv53dl.fsf%40novozymes.com
  (I have to admit I never got around to trying to revert the commit
  Alvaro Herrera suggested we try without
  
(https://www.postgresql.org/message-id/20170611033840.hruqadsk47qcdrqb%40alvherre.pgsql))

-- 
 "No more than that, but very powerful all theAdam Sjøgren
  same; simple things are good."a...@novozymes.com




Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2018-01-16 Thread Adam Sjøgren
  Hi Jorge,


This sounds very interesting - we are running PostgreSQL 9.3.20.

Did you ever find out exactly what the change that solved the problem
between 9.4.8 and 9.4.11 was?


  Best regards,

Adam


Jorge writes:

> Hi Adam , I've been seeing this same  kind of Error in my clusters for a 
> while .
>
> Is this running on postgresql 9.4.8? Because mine was fixed upgrading to 
> 9.4.11 , a bug indeed .
>
>
>
> Kind regards
>
>
>
> Jorge Daniel Fernandez
>
>
> 
> From: Adam Sjøgren 
> Sent: Tuesday, January 16, 2018 7:18 AM
> To: pgsql-gene...@postgresql.org
> Subject: Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 
> 76753264 in pg_toast_10920100
>
> We are seeing these "ERROR:  unexpected chunk number 0 (expected 1) for
> toast value 1498303849 in pg_toast_10919630" in increasing numbers again¹.
>
> An observation is that they seem to only happen for tsvector fields.
>
> Here is an example sequence of queries for a record (we have more than a
> handful of these currently), which exhibits the problem.
>
> First we get two other fields, 'sequence' is large enough to be toast'ed:
>
>   2018-01-16 08:51:17.362 efam=# select id,sequence from efam.sequence where 
> id = 164504550;
>   Time: 1.150 ms
>
> No problem.
>
> Then we also fetch the tsvector field:
>
>   2018-01-16 08:51:27.773 efam=# select id,sequence,fts from efam.sequence 
> where id = 164504550;
>   ERROR:  unexpected chunk number 0 (expected 1) for toast value 1498303849 
> in pg_toast_10919630
>   Time: 0.912 ms
>
> And we get the error.
>
> Getting the id and the tsvector:
>
>   2018-01-16 08:51:34.174 efam=# select id,fts from efam.sequence where id = 
> 164504550;
>   ERROR:  unexpected chunk number 0 (expected 1) for toast value 1498303849 
> in pg_toast_10919630
>   Time: 6.138 ms
>
> gives the error.
>
> Just getting the tsvector:
>
>   2018-01-16 08:51:40.066 efam=# select fts from efam.sequence where id = 
> 164504550;
>   ERROR:  unexpected chunk number 0 (expected 1) for toast value 1498303849 
> in pg_toast_10919630
>   Time: 1.805 ms
>
> Gives the error.
>
> Field definitions:
>
>id| integer |
>sequence  | text|
>fts   | tsvector|
>
> Maybe the observation that this only happens (for us) on tsvector rings
> a bell for someone?
>
>
>   Best regards,
>
> Adam
>
>
> ¹ As reported back in June, 2017, starting here:
>   https://www.postgresql.org/message-id/7pefuv53dl.fsf%40novozymes.com
>   (I have to admit I never got around to trying to revert the commit
>   Alvaro Herrera suggested we try without
>   
> (https://www.postgresql.org/message-id/20170611033840.hruqadsk47qcdrqb%40alvherre.pgsql))
>
> --
>  "No more than that, but very powerful all theAdam Sjøgren
>   same; simple things are good."a...@novozymes.com
>
>

-- 
 "No more than that, but very powerful all theAdam Sjøgren
  same; simple things are good."a...@novozymes.com




Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

2018-01-16 Thread Adam Sjøgren
Michael writes:

> On Tue, Jan 16, 2018 at 07:05:19PM +0100, Adam Sjøgren wrote:

>> This sounds very interesting - we are running PostgreSQL 9.3.20.

> Which means that we may be looking at a new bug, 9.3.20 is the latest in
> the 9.3 set as of today.

Yes; unfortunately we have failed to reproduce it outside production.

The fact that the tsvector field is always involved when we see the
error might be of interest (the tsvector field is the most updated in
our database, however).

Just for completeness, the tsvector field has a GIN index on it:

"sequence_fts_idx" gin (fts) WITH (fastupdate=off)

and it is updated by a BEFORE INSERT OR UPDATE trigger.

A new observation is that when we previously were able to get the
"unexpected chunk number" to go away by simply updating the tsvector
field of the offending record, we now have a record where we get "ERROR:
tuple concurrently updated" when we try overwriting the field.

On another record exhibiting the "unexpected chunk number" error, we
could overwrite the fts field, as can we on rows not affected by the
"unexpected chunk number"-error. So it seems the two errors might
somehow be related.

We tried stopping all activity on the database, and still got the
"ERROR: tuple concurrently updated" on the row with "unexpected chunk
number".

Also, the error we are getting is now: "unexpected chunk number 2
(expected 3) for toast value 1498303849 in pg_toast_10919630", where
previously we've only seen "unexpected chunk number 0 (expected 1)".

We are kind of at a loss, so any suggestions on what we could try are
welcome.

>> Did you ever find out exactly what the change that solved the problem
>> between 9.4.8 and 9.4.11 was?

> In this case, I think that you are looking for this thread:
> https://www.postgresql.org/message-id/20160826072658.15676.7...@wrigleys.postgresql.org

> And this commit:
> commit: a694435641faf26a9a4c210d20576ae836e86c48
> author: Tom Lane 
> date: Sat, 3 Sep 2016 13:28:53 -0400
> Fix corrupt GIN_SEGMENT_ADDITEMS WAL records on big-endian hardware.
>
> Both involved 9.4.8.

We run on x86_64-hardware, so I guess this wouldn't affect us?


  Best regards,

Adam

-- 
 "No more than that, but very powerful all theAdam Sjøgren
  same; simple things are good."a...@novozymes.com