Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100
Tom writes: > a...@novozymes.com (Adam =?utf-8?Q?Sj=C3=B8gren?=) writes: >> Also, the error we are getting is now: "unexpected chunk number 2 >> (expected 3) for toast value 1498303849 in pg_toast_10919630", where >> previously we've only seen "unexpected chunk number 0 (expected 1)". > >> We are kind of at a loss, so any suggestions on what we could try are >> welcome. > > The basic thrust of these messages is "I'm reading what should be > sequentially numbered data chunks for this toast OID, and the sequence > numbers are wrong". Both of these instances could be explained by > duplicate toast rows (or duplicate index entries pointing at one row), > though of course that would just move to the next question of how it > got that way. Anyway, you could move the investigation along with > some manual checking into what's in that toast table. For instance > > select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data) > from pg_toast.pg_toast_10919630 > where chunk_id = 1498303849 > order by 1,2; > > might be informative. If you do see what seem to be duplicate > chunk_seq values, checking whether they're still there in a > seqscan would be good. Here's a statement which currently gives an unexpected chunk error: efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '189909908'; ERROR: unexpected chunk number 0 (expected 1) for toast value 1698936148 in pg_toast_10919630 And when I run the suggested query, I get: efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 order by 1,2; chunk_id | chunk_seq | ctid |xmin| xmax | length +---+--++--+ 1698936148 | 0 | (52888694,2) | 1511390221 |0 | 1996 1698936148 | 1 | (52888694,4) | 1511390221 |0 | 1148 (2 rows) How would I go about checking if they are still in a seqscan? (Note: this is on PostgreSQL 9.3.22.) Best regards, Adam -- "No more than that, but very powerful all theAdam Sjøgren same; simple things are good."a...@novozymes.com
Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100
Adam writes: > Here's a statement which currently gives an unexpected chunk error: > > efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '189909908'; > ERROR: unexpected chunk number 0 (expected 1) for toast value 1698936148 > in pg_toast_10919630 > > And when I run the suggested query, I get: > > efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, > length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = > 1698936148 order by 1,2; > chunk_id | chunk_seq | ctid |xmin| xmax | length > +---+--++--+ >1698936148 | 0 | (52888694,2) | 1511390221 |0 | 1996 >1698936148 | 1 | (52888694,4) | 1511390221 |0 | 1148 > (2 rows) More examples: efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '237764759'; ERROR: unexpected chunk number 0 (expected 1) for toast value 1698958350 in pg_toast_10919630 efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698958350 order by 1,2; chunk_id | chunk_seq | ctid |xmin| xmax | length +---+--++--+ 1698958350 | 0 | (54859821,2) | 1511487270 |0 | 1448 (1 row) And: efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '366275833'; ERROR: unexpected chunk number 0 (expected 1) for toast value 1698945095 in pg_toast_10919630 efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698945095 order by 1,2; chunk_id | chunk_seq | ctid |xmin| xmax | length +---+--++--+ 1698945095 | 0 | (53706565,3) | 1511426847 |0 | 1996 1698945095 | 1 | (53706565,6) | 1511426847 |0 |108 (2 rows) One more: efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '189909908'; ERROR: unexpected chunk number 0 (expected 1) for toast value 1698936148 in pg_toast_10919630 efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 order by 1,2; chunk_id | chunk_seq | ctid |xmin| xmax | length +---+--++--+ 1698936148 | 0 | (52888694,2) | 1511390221 |0 | 1996 1698936148 | 1 | (52888694,4) | 1511390221 |0 | 1148 (2 rows) And here is one from another table: efamroot@kat efam=# SELECT * FROM efam.sequence WHERE id = '235887163'; ERROR: unexpected chunk number 0 (expected 1) for toast value 1698750544 in pg_toast_10919630 efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698750544 order by 1,2; chunk_id | chunk_seq | ctid |xmin| xmax | length +---+--++--+ 1698750544 | 0 | (39575142,3) | 1510704835 |0 | 1996 1698750544 | 1 | (39575142,4) | 1510704835 |0 |716 (2 rows) Let me know what other relevant info I can provide. Best regards, Adam -- "No more than that, but very powerful all theAdam Sjøgren same; simple things are good."a...@novozymes.com
Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100
Tom writes: >> And when I run the suggested query, I get: > >> efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, >> length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = >> 1698936148 order by 1,2; >> chunk_id | chunk_seq | ctid |xmin| xmax | length >> +---+--++--+ >>1698936148 | 0 | (52888694,2) | 1511390221 |0 | 1996 >>1698936148 | 1 | (52888694,4) | 1511390221 |0 | 1148 >> (2 rows) > Hmph. So if you EXPLAIN that query, does it show it's doing it as an > indexscan? I'd expect so, but it's always good to make sure. It does: efamroot@kat efam=# SELECT * FROM efam.sendreference WHERE id = '189909908'; ERROR: unexpected chunk number 0 (expected 1) for toast value 1698936148 in pg_toast_10919630 efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 order by 1,2; chunk_id | chunk_seq | ctid |xmin| xmax | length +---+--++--+ 1698936148 | 0 | (52888694,2) | 1511390221 |0 | 1996 1698936148 | 1 | (52888694,4) | 1511390221 |0 | 1148 (2 rows) efamroot@kat efam=# explain select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 order by 1,2; QUERY PLAN -- Index Scan using pg_toast_10919630_index on pg_toast_10919630 (cost=0.57..2627179.25 rows=2135674 width=54) Index Cond: (chunk_id = 1698936148::oid) (2 rows) > Assuming it does say that, then the other test I had in mind would > involve "set enable_indexscan = 0", then repeat the EXPLAIN to make > sure that you now get a seqscan plan (you might need to turn off > enable_bitmapscan too), then do the query again and see whether the > results are the same. Ok (if I don't disable bitmapscan, I get Bitmap Heap Scans in the EXPLAIN, so): efamroot@kat efam=# set enable_indexscan = 0; SET efamroot@kat efam=# set enable_bitmapscan = 0; SET efamroot@kat efam=# explain select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 order by 1,2; QUERY PLAN - Sort (cost=96465280.57..96470619.75 rows=2135674 width=54) Sort Key: chunk_seq -> Seq Scan on pg_toast_10919630 (cost=0.00..96240754.39 rows=2135674 width=54) Filter: (chunk_id = 1698936148::oid) (4 rows) efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 order by 1,2; [... still waiting for the result, I will return with what it said when the server does ...] Best regards, Adam -- "No more than that, but very powerful all theAdam Sjøgren same; simple things are good."a...@novozymes.com
Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100
Adam writes: > efamroot@kat efam=# explain select chunk_id, chunk_seq, ctid, xmin, xmax, > length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = > 1698936148 order by 1,2; >QUERY PLAN > > > - >Sort (cost=96465280.57..96470619.75 rows=2135674 width=54) > Sort Key: chunk_seq > -> Seq Scan on pg_toast_10919630 (cost=0.00..96240754.39 rows=2135674 > width=54) >Filter: (chunk_id = 1698936148::oid) > (4 rows) > > efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, > length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = > 1698936148 order by 1,2; > [... still waiting for the result, I will return with what it said >when the server does ...] It did eventually finish, with the same result: efamroot@kat efam=# select chunk_id, chunk_seq, ctid, xmin, xmax, length(chunk_data) from pg_toast.pg_toast_10919630 where chunk_id = 1698936148 order by 1,2; chunk_id | chunk_seq | ctid |xmin| xmax | length +---+--++--+ 1698936148 | 0 | (52888694,2) | 1511390221 |0 | 1996 1698936148 | 1 | (52888694,4) | 1511390221 |0 | 1148 (2 rows) Best regards, Adam -- "No more than that, but very powerful all theAdam Sjøgren same; simple things are good."a...@novozymes.com
Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100
Pavan writes: >> FWIW one of our support customers reported a very similar TOAST table >> corruption issue last week which nearly caused an outage. After a lot of >> analysis, I think I've now fully understood the reasons behind the >> corruption, the underlying bug(s) and possible remedy. I am currently >> working on writing a reproducible test case to demonstrate the problem and >> writing the fix. More details on that soon. > > I've posted a reproducer and a proposed fix to -hackers [1] > > In the particular case that I investigated, a database crash/recovery was > involved. But I think we should be able to create a scenario where OID > wrap-around or a standby promotion triggers the problem. I don't know if > any of that was involved in the cases reported on this thread, but I've a > strong suspicion that the underlying bug is probably the same. In our case there was no crash+recovery, but we do have a high write load (and the problem occurring quite seldom), so it sounds like it fits. > [1] > https://www.postgresql.org/message-id/CABOikdOgWT2hHkYG3Wwo2cyZJq2zfs1FH0FgX-%3Dh4OLosXHf9w%40mail.gmail.com Impressive investigation! Just a couple of data points: your script to reproduce the problem does so for all three versions of PostgreSQL we have in use: 9.3.22, 9.6.8 and 10.3. And I specifically tested on our production machine which exhibits the problem, running 9.3.22, and your script reproduced the problem there as well. I.e. ending with: ERROR: unexpected chunk number 0 (expected 1) for toast value 16396 in pg_toast_16384 REINDEX ERROR: unexpected chunk number 0 (expected 1) for toast value 16396 in pg_toast_16384 VACUUM ERROR: unexpected chunk number 0 (expected 1) for toast value 16396 in pg_toast_16384 Best regards, Adam -- "No more than that, but very powerful all theAdam Sjøgren same; simple things are good."a...@novozymes.com
Database logins taking longer and longer, showing up as "authentication" in ps(1)
Hi, We have a PostgreSQL 11.3¹ running on an Ubuntu 16.04 server, which sometimes exhibits a behaviour I can't quite understand: simply logging into the database starts to take minutes to complete. We have 60 processes (workers) running on different machines accessing the database, that all grab jobs from a queue and update rows in a table after doing some calculations (which vary in time from <1s to perhaps a minute, many of them fast). Sometimes new database logins slow down, from usually taking <0.05s to taking minutes. This is for psql as a normal user using Kerberos, for psql as the postgres superuser, for the web-application logging into the database, for everything. What I see in the output of ps(1) is a lot of postgres processes in state "authentication", when this problem occurs. It "feels" like the logins are waiting for something - when e.g. psql finally logs in, running queries seem normal. The only "special" thing I can think of is that the workers use advisory locks on rows in the table they are updating (as there can be multiple jobs in flight at the same time, for the same row). I can imagine that a pattern of jobs and workers could make the locking cause those workers to wait and slow each other down, but I don't understand how it can affect the time it takes for new database logins. When this happens, if I stop the 60 workers, the login times are back to normal within seconds, and all the postgres processes showing "authentication" switch away from that state. If I start the workers again, the login problem appears within a couple of minutes. If I start a smaller number of workers, say 5, the problem doesn't appear. Could this be lock contention on a database table somehow affecting logins? I have tried searching for this problem, and found some older references to things² that seem to have been fixed in the meantime. Is anybody else seeing this? The server is configured with max_connections 1200 and currently has ~400 connections. It has 1.5 TB of memory and 112 cores (including hyperthreading) - the load is not high when I see this problem, between 20 and 30. Any ideas and suggestions are welcome - if there is some relevant information I can provide, I will be happy to try. Best regards, Adam ¹ We should plan to upgrade to 11.9, yes. ² E.g. https://www.postgresql.org/message-id/CAJw8uJQGoobesbPCMbxj6Vb4nv9D-GgvZ%2B7pK%2Bfckbb4DqJEAg%40mail.gmail.com and https://www.postgresql.org/message-id/20140312232810.634.7997%40wrigleys.postgresql.org -- "We don't care about crashing our cars Adam Sjøgren Pedal to the metal and our heads among the stars"a...@koldfront.dk
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
Laurenz writes: > On Tue, 2020-08-18 at 14:53 +0200, Adam Sjøgren wrote: >> >> We have 60 processes (workers) running on different machines accessing >> the database, that all grab jobs from a queue and update rows in a table >> after doing some calculations (which vary in time from <1s to perhaps a >> minute, many of them fast). >> >> Sometimes new database logins slow down, from usually taking <0.05s to >> taking minutes. This is for psql as a normal user using Kerberos, for >> psql as the postgres superuser, for the web-application logging into the >> database, for everything. > > I would suspect that the problem is with the authentication method. Interesting, how are you suspecting the interaction to be playing out? > How is authenticatoin configured? Primarily using gss (for the web-application, delegated credentials), md5 (system users), and a few trust (for legacy access). > I'd "strace" a connection that is hanging in "authenticating" and > see what the process does. I did that one of the previous times I have seen this problem, and saw stuff like this: 2020-06-10 11:22:14strace: Process 101559 attached 2020-06-10 11:22:14futex(0x7f46df29e138, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ) = 0 ... 2020-06-10 11:24:09futex(0x7f46df29e138, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ) = 0 2020-06-10 11:24:09futex(0x7f46df2a2e38, FUTEX_WAKE, 1)= 1 2020-06-10 11:24:09brk(0x55c197084000) = 0x55c197084000 2020-06-10 11:24:09brk(0x55c1970b) = 0x55c1970b 2020-06-10 11:24:09open("global/pg_filenode.map", O_RDONLY) = 6 (where the elided part is futex() calls). The slow login is regardless of whether I login with Kerberos or as the postgres superuser with psql. Best regards, Adam -- "Sometimes I feel like going home, walking in the Adam Sjøgren boardrooma...@koldfront.dk Sometimes I feel like I wanna go, walk into the ballroom"
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
Stephen writes: > * Adam Sjøgren (a...@koldfront.dk) wrote: >> >> We have 60 processes (workers) running on different machines accessing >> the database, that all grab jobs from a queue and update rows in a table >> after doing some calculations (which vary in time from <1s to perhaps a >> minute, many of them fast). >> >> Sometimes new database logins slow down, from usually taking <0.05s to >> taking minutes. This is for psql as a normal user using Kerberos, for >> psql as the postgres superuser, for the web-application logging into the >> database, for everything. > > When in doubt, blame DNS. I'd love to! However I don't see any DNS outages on our local network correlating with whether I run 60 workers or 5 workers. > Alternatively, in your case, the issue might be the KDC taking forever > to issue a ticket for the service. If that was the cause, logging in as the 'postgres' superuser (not using Kerberos) locally on the server should be fast regardless, right? > (though you might check if you have log_hostnames on..). It's off: $ grep hostname /etc/postgresql/11/main/postgresql.conf #log_hostname = off Thanks for replying! Best regards, Adam -- "Probabilistic algorithms don't appeal to me. (ThisAdam Sjøgren is a question of aesthetics, not practicality.) So a...@koldfront.dk later, I figured out how to remove the probability and turn it into a deterministic algorithm."
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
Tom writes: > Stephen Frost writes: >> * Adam Sjøgren (a...@koldfront.dk) wrote: >>> Sometimes new database logins slow down, from usually taking <0.05s to >>> taking minutes. This is for psql as a normal user using Kerberos, for >>> psql as the postgres superuser, for the web-application logging into the >>> database, for everything. > > I think looking into pg_stat_activity and pg_locks might be interesting, > although it's possible the delay is before the new session has made > any entries there. > > Also, obtaining stack traces from a few of the stuck processes would > likely be quite informative. I will try to look at pg_stat_activity and pg_locks the next time - it just happened now and I focused on getting some stacktraces - unfortunately I do not have the -dbg package installed, so I don't know how useful they are. Here's the first one: $ ps faux | grep postgres | grep authentication postgres 16697 0.0 0.0 34686088 8352 ? Ss 14:45 0:00 \_ postgres: 11/main: workqueue XXdb 127.0.0.1(42534) authentication postgres 16701 0.0 0.0 34686088 6476 ? Ss 14:45 0:00 \_ postgres: 11/main: anonymous XXdb 127.0.0.1(42536) authentication postgres 16702 0.0 0.0 34686088 8348 ? Ss 14:45 0:00 \_ postgres: 11/main: svc_user XXdb 127.0.0.1(42540) authentication postgres 16704 0.0 0.0 34686040 5376 ? Ss 14:45 0:00 \_ postgres: 11/main: postgres postgres [local] authentication postgres 16705 0.0 0.0 34686088 8352 ? Ss 14:45 0:00 \_ postgres: 11/main: svc_user XXdb 127.0.0.1(42542) authentication postgres 16708 0.0 0.0 34686088 6476 ? Ss 14:45 0:00 \_ postgres: 11/main: svc_user XXdb 127.0.0.1(42544) authentication $ sudo gdb -p 16708 GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1 [...] Attaching to process 16708 Reading symbols from /usr/lib/postgresql/11/bin/postgres...(no debugging symbols found)...done. [...] 0x7f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 86../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) bt #0 0x7f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 #1 0x55f60d3d0101 in WaitEventSetWait () #2 0x55f60d2dccff in secure_read () #3 0x55f60d2e9138 in ?? () #4 0x55f60d2e9e75 in pq_getbyte () #5 0x55f60d3f3100 in PostgresMain () #6 0x55f60d102bf8 in ?? () #7 0x55f60d37e492 in PostmasterMain () #8 0x55f60d103fa5 in main () (gdb) I think I was too slow on this one, and this is what it looks like if the process is just waiting for a connection (as you can guess, I'm not used to looking at backtraces). Here is another one, which hopefully is caught at a better moment, and more useful: postgres 19495 0.0 0.0 34686088 8412 ? Ss 14:48 0:00 \_ postgres: 11/main: workqueue XXdb 127.0.0.1(43390) authentication postgres 19497 0.0 0.0 34686088 6476 ? Ss 14:48 0:00 \_ postgres: 11/main: XXX XXdb 127.0.0.1(43392) authentication postgres 19502 0.0 0.0 34686088 8412 ? Ss 14:48 0:00 \_ postgres: 11/main: XXX XXdb 127.0.0.1(43394) authentication $ sudo gdb -p 19495 GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1 [...] Attaching to process 19495 [...] 0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory. (gdb) bt #0 0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x7f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x7f8272cf297a in __new_sem_wait (sem=) at sem_wait.c:29 #4 0x55f60d36c252 in PGSemaphoreLock () #5 0x55f60d3e35cc in LWLockAcquire () #6 0x55f60d3d14ee in ProcArrayAdd () #7 0x55f60d3e01a3 in InitProcessPhase2 () #8 0x55f60d51c99b in InitPostgres () #9 0x55f60d3f2baf in PostgresMain () #10 0x55f60d102bf8 in ?? () #11 0x55f60d37e492 in PostmasterMain () #12 0x55f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 in ../sysdeps/unix/sysv/linux/futex-internal.h (gdb) bt #0 0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at ../sysdeps/unix/sysv/linu
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
elable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 in ../sysdeps/unix/sysv/linux/futex-internal.h (gdb) bt #0 0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x7f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x7f8272cf297a in __new_sem_wait (sem=) at sem_wait.c:29 #4 0x55f60d36c252 in PGSemaphoreLock () #5 0x55f60d3e35cc in LWLockAcquire () #6 0x55f60d3d14ee in ProcArrayAdd () #7 0x55f60d3e01a3 in InitProcessPhase2 () #8 0x55f60d51c99b in InitPostgres () #9 0x55f60d3f2baf in PostgresMain () #10 0x55f60d102bf8 in ?? () #11 0x55f60d37e492 in PostmasterMain () #12 0x55f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 in ../sysdeps/unix/sysv/linux/futex-internal.h (gdb) bt #0 0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x7f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x7f8272cf297a in __new_sem_wait (sem=) at sem_wait.c:29 #4 0x55f60d36c252 in PGSemaphoreLock () #5 0x55f60d3e35cc in LWLockAcquire () #6 0x55f60d3d14ee in ProcArrayAdd () #7 0x55f60d3e01a3 in InitProcessPhase2 () #8 0x55f60d51c99b in InitPostgres () #9 0x55f60d3f2baf in PostgresMain () #10 0x55f60d102bf8 in ?? () #11 0x55f60d37e492 in PostmasterMain () #12 0x55f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 in ../sysdeps/unix/sysv/linux/futex-internal.h (gdb) bt #0 0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x7f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x7f8272cf297a in __new_sem_wait (sem=) at sem_wait.c:29 #4 0x55f60d36c252 in PGSemaphoreLock () #5 0x55f60d3e35cc in LWLockAcquire () #6 0x55f60d3d14ee in ProcArrayAdd () #7 0x55f60d3e01a3 in InitProcessPhase2 () #8 0x55f60d51c99b in InitPostgres () #9 0x55f60d3f2baf in PostgresMain () #10 0x55f60d102bf8 in ?? () #11 0x55f60d37e492 in PostmasterMain () #12 0x55f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x7f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 86../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) bt #0 0x7f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 #1 0x55f60d3d0101 in WaitEventSetWait () #2 0x55f60d2dccff in secure_read () #3 0x55f60d2e9138 in ?? () #4 0x55f60d2e9e75 in pq_getbyte () #5 0x55f60d3f3100 in PostgresMain () #6 0x55f60d102bf8 in ?? () #7 0x55f60d37e492 in PostmasterMain () #8 0x55f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x7f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 86in ../sysdeps/unix/syscall-template.S (gdb) detach Detaching from program: /usr/lib/postgresql/11/bin/postgres, process 79572 (gdb) quit I hope this is useful. Let me know if there are any parameters I should try adjusting, or some other information I should try to extract. Thanks! Best regards, Adam -- "Why do you ask so much?" Adam Sjøgren "I'm a kid. It's what we do." a...@koldfront.dk
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
Tom writes: > Frequently, stack traces taken without debug symbols are pretty useless, > but you seem to have got lucky. For once, hah! > It seems clear that the unexpected wait is here: > >> (gdb) bt >> #0 0x7f8272cf2827 in futex_abstimed_wait_cancelable (private=128, >> abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38) at >> ../sysdeps/unix/sysv/linux/futex-internal.h:205 >> #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at >> sem_waitcommon.c:111 >> #2 0x7f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, >> abstime=0x0) at sem_waitcommon.c:181 >> #3 0x7f8272cf297a in __new_sem_wait (sem=) at >> sem_wait.c:29 >> #4 0x55f60d36c252 in PGSemaphoreLock () >> #5 0x55f60d3e35cc in LWLockAcquire () >> #6 0x55f60d3d14ee in ProcArrayAdd () >> #7 0x55f60d3e01a3 in InitProcessPhase2 () >> #8 0x55f60d51c99b in InitPostgres () >> #9 0x55f60d3f2baf in PostgresMain () >> #10 0x55f60d102bf8 in ?? () >> #11 0x55f60d37e492 in PostmasterMain () >> #12 0x55f60d103fa5 in main () > > which implies that the problem is unexpectedly high contention for the > ProcArrayLock. It's unsurprising that incoming processes would need > that lock, since they need to add themselves to the ProcArray. What's > not so apparent is why there is so much activity as to block them > from doing so for a significant amount of time. We can guess that > the "worker processes" you mentioned are ganging up on that lock > because they need to compute query snapshots, but that still doesn't > seem like enough of an explanation. > > Can you show us exactly what the "worker processes" are doing while > this is going on? One thing I should have mentioned, but forgot, is that the database is configured to do logical replication to another machine - could that have an effect on the lock contention? A colleague pointed out that in the pg_locks output, the replication processes are waiting on the ProcArray lock: · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz I can try turning off replication the next time I see the problem. The workers basically loop on: * SELECT a job from a job-table, UPDATE it * Take an advisory lock on the relevant row in the calculations table * Do some work, at least a couple of SELECTs from various tables * UPDATE or INSERT a row into the calculations table, release the lock * INSERT a row in an exitstatus table * DELETE the job The calculations can be sub-second to several minutes, the majority mostly in the shorter end. As this is happening there are millions of jobs in the job-table. When the problem occurs, the worker logfiles indicate that they are moving along, still working. To give a sense of scale, here are the number of transactions per second plotted: · https://koldfront.dk/misc/postgresql/postgres_transactions_ALL-pinpoint_1597799383-1597907383.png (I think the hole in the graph is when Munin timed out getting the information, and the number of transactions dropped as I stopped most of the workers.) Best regards, Adam -- "Examination and mastering of a new highly Adam Sjøgren intellectual equipment was a hard labour." a...@koldfront.dk
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
Tom writes: > =?utf-8?Q?Adam_Sj=C3=B8gren?= writes: >> Tom writes: >>> ... which implies that the problem is unexpectedly high contention for the >>> ProcArrayLock. > >> One thing I should have mentioned, but forgot, is that the database is >> configured to do logical replication to another machine - could that >> have an effect on the lock contention? >> A colleague pointed out that in the pg_locks output, the replication >> processes are waiting on the ProcArray lock: >> · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz > > Yeah, that is *mighty* interesting. For the archives' sake, I'll point > out that this shows 16 walsender processes, of which 6 are in > WalSenderWaitForWAL waits (ie, basically idle) and all of the other 10 > are waiting for the ProcArrayLock. Sorry for not returning on this before now. I have seen the slow login problem - ProcArrayLock contention - happen with replication turned off, so I think that can be ruled out as the cause. I just observed something interesting right now, however. Just to recap the scene: the basic setup is that we have a table with a queue of jobs to be processed, and we have a number of programs (say 10) on 6 other machines grabbing those jobs, doing calculations and reporting back. The number of jobs at any given time can fluctuate from none to millions. Typically millions of jobs take some days to a week to get through. Now, I happened to have a psql prompt open when the problem of new logins being slow appeared - and I ran a query that counted the number of jobs by job-type. Before the slow login problem, that query took around 70-80 ms, but when the slow login appeared, it took 20+ seconds to run the query. EXPLAIN ANALYZE showed that it was running a parallel query. So I tried turning parallel queries off - with SET max_parallel_workers_per_gather = 0; - and now the query took around 150 ms consistenly, despite logins still being (variably) slow. So that confirms the ProcArrayLock contention, I think. My next thought is: I have 6*10 programs making connections and doing parallel queries over the job queue, to get new jobs. If the jobs happen to be very "easy", those programs will be doing a lot of queries for jobs. And when the queue is large enough (I'm guessing), those queries will be parallel, so they will be creating new processes that need to be added and removed from the ProcArray. So am I shooting myself in the foot here, by creating lock contention when trying to get jobs quickly from a largeish table? I think my next step will be to modify the programs that grab the jobs to avoid parallel queries when looking up the next job to run. You anticipated this, it just took me until now to get a little closer: > It's also evident that there's some parallel query activity going on, > as some sessions are in BgWorkerShutdown waits; so parallel worker > entry/exit might be contributing to the overall load on ProcArrayLock. Best regards, Adam -- "That's one of the remarkable things about life. Adam Sjøgren It's never so bad that it can't get worse." a...@koldfront.dk
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
Adrian writes: > I don't have an answer. Not even sure if this is relevant to the > problem, but how are the jobs getting into the queue? Plain INSERTs - often a lot at the same time. I inserted 400K jobs to clean up after a bug earlier today, for instance. What were you suspecting? Best regards, Adam -- "Don't think twice Adam Sjøgren It's all right" a...@koldfront.dk
Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100
We are seeing these "ERROR: unexpected chunk number 0 (expected 1) for toast value 1498303849 in pg_toast_10919630" in increasing numbers again¹. An observation is that they seem to only happen for tsvector fields. Here is an example sequence of queries for a record (we have more than a handful of these currently), which exhibits the problem. First we get two other fields, 'sequence' is large enough to be toast'ed: 2018-01-16 08:51:17.362 efam=# select id,sequence from efam.sequence where id = 164504550; Time: 1.150 ms No problem. Then we also fetch the tsvector field: 2018-01-16 08:51:27.773 efam=# select id,sequence,fts from efam.sequence where id = 164504550; ERROR: unexpected chunk number 0 (expected 1) for toast value 1498303849 in pg_toast_10919630 Time: 0.912 ms And we get the error. Getting the id and the tsvector: 2018-01-16 08:51:34.174 efam=# select id,fts from efam.sequence where id = 164504550; ERROR: unexpected chunk number 0 (expected 1) for toast value 1498303849 in pg_toast_10919630 Time: 6.138 ms gives the error. Just getting the tsvector: 2018-01-16 08:51:40.066 efam=# select fts from efam.sequence where id = 164504550; ERROR: unexpected chunk number 0 (expected 1) for toast value 1498303849 in pg_toast_10919630 Time: 1.805 ms Gives the error. Field definitions: id| integer | sequence | text| fts | tsvector| Maybe the observation that this only happens (for us) on tsvector rings a bell for someone? Best regards, Adam ¹ As reported back in June, 2017, starting here: https://www.postgresql.org/message-id/7pefuv53dl.fsf%40novozymes.com (I have to admit I never got around to trying to revert the commit Alvaro Herrera suggested we try without (https://www.postgresql.org/message-id/20170611033840.hruqadsk47qcdrqb%40alvherre.pgsql)) -- "No more than that, but very powerful all theAdam Sjøgren same; simple things are good."a...@novozymes.com
Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100
Hi Jorge, This sounds very interesting - we are running PostgreSQL 9.3.20. Did you ever find out exactly what the change that solved the problem between 9.4.8 and 9.4.11 was? Best regards, Adam Jorge writes: > Hi Adam , I've been seeing this same kind of Error in my clusters for a > while . > > Is this running on postgresql 9.4.8? Because mine was fixed upgrading to > 9.4.11 , a bug indeed . > > > > Kind regards > > > > Jorge Daniel Fernandez > > > > From: Adam Sjøgren > Sent: Tuesday, January 16, 2018 7:18 AM > To: pgsql-gene...@postgresql.org > Subject: Re: ERROR: unexpected chunk number 0 (expected 1) for toast value > 76753264 in pg_toast_10920100 > > We are seeing these "ERROR: unexpected chunk number 0 (expected 1) for > toast value 1498303849 in pg_toast_10919630" in increasing numbers again¹. > > An observation is that they seem to only happen for tsvector fields. > > Here is an example sequence of queries for a record (we have more than a > handful of these currently), which exhibits the problem. > > First we get two other fields, 'sequence' is large enough to be toast'ed: > > 2018-01-16 08:51:17.362 efam=# select id,sequence from efam.sequence where > id = 164504550; > Time: 1.150 ms > > No problem. > > Then we also fetch the tsvector field: > > 2018-01-16 08:51:27.773 efam=# select id,sequence,fts from efam.sequence > where id = 164504550; > ERROR: unexpected chunk number 0 (expected 1) for toast value 1498303849 > in pg_toast_10919630 > Time: 0.912 ms > > And we get the error. > > Getting the id and the tsvector: > > 2018-01-16 08:51:34.174 efam=# select id,fts from efam.sequence where id = > 164504550; > ERROR: unexpected chunk number 0 (expected 1) for toast value 1498303849 > in pg_toast_10919630 > Time: 6.138 ms > > gives the error. > > Just getting the tsvector: > > 2018-01-16 08:51:40.066 efam=# select fts from efam.sequence where id = > 164504550; > ERROR: unexpected chunk number 0 (expected 1) for toast value 1498303849 > in pg_toast_10919630 > Time: 1.805 ms > > Gives the error. > > Field definitions: > >id| integer | >sequence | text| >fts | tsvector| > > Maybe the observation that this only happens (for us) on tsvector rings > a bell for someone? > > > Best regards, > > Adam > > > ¹ As reported back in June, 2017, starting here: > https://www.postgresql.org/message-id/7pefuv53dl.fsf%40novozymes.com > (I have to admit I never got around to trying to revert the commit > Alvaro Herrera suggested we try without > > (https://www.postgresql.org/message-id/20170611033840.hruqadsk47qcdrqb%40alvherre.pgsql)) > > -- > "No more than that, but very powerful all theAdam Sjøgren > same; simple things are good."a...@novozymes.com > > -- "No more than that, but very powerful all theAdam Sjøgren same; simple things are good."a...@novozymes.com
Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100
Michael writes: > On Tue, Jan 16, 2018 at 07:05:19PM +0100, Adam Sjøgren wrote: >> This sounds very interesting - we are running PostgreSQL 9.3.20. > Which means that we may be looking at a new bug, 9.3.20 is the latest in > the 9.3 set as of today. Yes; unfortunately we have failed to reproduce it outside production. The fact that the tsvector field is always involved when we see the error might be of interest (the tsvector field is the most updated in our database, however). Just for completeness, the tsvector field has a GIN index on it: "sequence_fts_idx" gin (fts) WITH (fastupdate=off) and it is updated by a BEFORE INSERT OR UPDATE trigger. A new observation is that when we previously were able to get the "unexpected chunk number" to go away by simply updating the tsvector field of the offending record, we now have a record where we get "ERROR: tuple concurrently updated" when we try overwriting the field. On another record exhibiting the "unexpected chunk number" error, we could overwrite the fts field, as can we on rows not affected by the "unexpected chunk number"-error. So it seems the two errors might somehow be related. We tried stopping all activity on the database, and still got the "ERROR: tuple concurrently updated" on the row with "unexpected chunk number". Also, the error we are getting is now: "unexpected chunk number 2 (expected 3) for toast value 1498303849 in pg_toast_10919630", where previously we've only seen "unexpected chunk number 0 (expected 1)". We are kind of at a loss, so any suggestions on what we could try are welcome. >> Did you ever find out exactly what the change that solved the problem >> between 9.4.8 and 9.4.11 was? > In this case, I think that you are looking for this thread: > https://www.postgresql.org/message-id/20160826072658.15676.7...@wrigleys.postgresql.org > And this commit: > commit: a694435641faf26a9a4c210d20576ae836e86c48 > author: Tom Lane > date: Sat, 3 Sep 2016 13:28:53 -0400 > Fix corrupt GIN_SEGMENT_ADDITEMS WAL records on big-endian hardware. > > Both involved 9.4.8. We run on x86_64-hardware, so I guess this wouldn't affect us? Best regards, Adam -- "No more than that, but very powerful all theAdam Sjøgren same; simple things are good."a...@novozymes.com