Re: "PANIC: could not open critical system index 2662" - twice

2023-05-07 Thread Thomas Munro
On Sun, May 7, 2023 at 1:21 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Did you previously run this same workload on versions < 15 and never
> > see any problem?  15 gained a new feature CREATE DATABASE ...
> > STRATEGY=WAL_LOG, which is also the default.  I wonder if there is a
> > bug somewhere near that, though I have no specific idea.
>
> Per the release notes I was just writing ...
>
> 
> 
>  
>   Fix potential corruption of the template (source) database after
>   CREATE DATABASE with the STRATEGY
>   WAL_LOG option (Nathan Bossart, Ryo Matsumura)
>  

Hmm.  That bug seems to have caused corruption (backwards time travel)
of blocks in the *source* DB's pg_class, by failing to write back
changes.  We seem to have zeroed pages in the *target* database, for
all catalogs (apparently everything copied by
RelationCopyStorageUsingBuffer()), even though the template is still
fine.  It is as if RelationCopyStorageUsingBuffer() created the
zero-filed file with smgrextend(), but then the buffer data was never
written out even though we memcpy'd it into the a buffer and set the
buffer dirty.

Bug-in-PostgreSQL explanations could include that we forgot it was
dirty, or some backend wrote it out to the wrong file; but if we were
forgetting something like permanent or dirty, would there be a more
systematic failure?  Oh, it could require special rare timing if it is
similar to 8a8661828's confusion about permanence level or otherwise
somehow not setting BM_PERMANENT, but in the target blocks, so I think
that'd require a checkpoint AND a crash.  It doesn't reproduce for me,
but perhaps more unlucky ingredients are needed.

Bug-in-OS/FS explanations could include that a whole lot of writes
were mysteriously lost in some time window, so all those files still
contain the zeroes we write first in smgrextend().  I guess this
previously rare (previously limited to hash indexes?) use of sparse
file hole-punching could be a factor in an it's-all-ZFS's-fault
explanation:

openat(AT_FDCWD,"base/16390/2662",O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC,0600)
= 36 (0x24)
openat(AT_FDCWD,"base/1/2662",O_RDWR|O_CLOEXEC,00) = 37 (0x25)
lseek(37,0x0,SEEK_END)   = 32768 (0x8000)
lseek(37,0x0,SEEK_END)   = 32768 (0x8000)
pwrite(36,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,8192,0x6000) = 8192
(0x2000) <-- smgrextend(final block)
lseek(36,0x0,SEEK_END)   = 32768 (0x8000)

I  was trying to think about how I might go about trying to repro the
exact system setup.  Evgeny, do you mind sharing your "zfs get all
/path/to/pgdata" (curious to see block size, compression settings,
anything else etc) and your postgresql.conf? And your exact Ubuntu
kernel version and ZFS package versions?




Re: Death postgres

2023-05-07 Thread Marc Millas
Le sam. 6 mai 2023 à 21:46, Adrian Klaver  a
écrit :

> On 5/6/23 10:13, Marc Millas wrote:
> >
> >
>
> > When you restarted the server where there any warnings shown?
> >
> > Sadly, I cannot. Will be done next tuesday.
>
> Cannot do what:
>
> 1) Get to the log to see if there are warnings?
>
> 2) Restart the server?
>
> Your original post said the server crashed.
>
> If that was the case how can you do any of the below without restarting it?
> Infrastructure is managed by another company. The db is within a
> container.
>
So.. If I run a pg_ctl restart, the container... Vanished.
So for each main thing, I must ask... By mail..

> >
> >
> > Using psql can you \d ?
> >
> > Yes, and no pb to check pg_statistic,...
> >
> >
> > Can you select from any other table in the database?
> >
> > Yes
> >
> >
> >
> >
> >
> >
> > --
> > Adrian Klaver
> > adrian.kla...@aklaver.com 
> >
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>
>


Re: "PANIC: could not open critical system index 2662" - twice

2023-05-07 Thread Evgeny Morozov
On 6/05/2023 11:13 pm, Thomas Munro wrote:
> Did you previously run this same workload on versions < 15 and never
> see any problem?
Yes, kind of. We have a test suite that creates one test DB and runs a
bunch of tests on it. Two of these tests, however, create another DB
each (also by cloning the same template DB) in order to test copying
data between DBs. It's only these "extra" DBs that were corrupted, at
least on this occasion. (Hard to say about the last time, because that
time it all went south and the whole server crashed, and we may have had
some residual corruption from bad disks then - who knows.) I'm not sure
whether the tests that created the extra DBs existed before we upgraded
to PG 15, but we definitely have not seen such problems on PG 13 or 14.

> It seems like you have some kind of high frequency testing workload that 
> creates and tests databases all day long, and just occasionally detects this 
> corruption.
Maybe 10-30 times per day normally, depending on the day. However, I
have tried to repro this by running those two specific tests thousands
of times in one day, without success.
> Would you like to try requesting FILE_COPY for a while and see if it 
> eventually happens like that too?
Sure, we can try that.

On 7/05/2023 12:30 pm, Thomas Munro wrote:
> your "zfs get all /path/to/pgdata"

PROPERTY  VALUE  SOURCE
type  filesystem -
creation  Mon Mar  6 17:07 2023  -
used  166G   -
available 2.34T  -
referenced    166G   -
compressratio 2.40x  -
mounted   yes    -
quota none   default
reservation   none   default
recordsize    16K    local
mountpoint    /default
sharenfs  off    default
checksum  on default
compression   lz4    received
atime off    inherited from pgdata
devices   on default
exec  off    inherited from pgdata
setuid    off    inherited from pgdata
readonly  off    default
zoned off    default
snapdir   hidden default
aclinherit    restricted default
createtxg 90 -
canmount  on received
xattr on default
copies    1  default
version   5  -
utf8only  off    -
normalization none   -
casesensitivity   sensitive  -
vscan off    default
nbmand    off    default
sharesmb  off    default
refquota  none   default
refreservation    none   default
primarycache  all    default
secondarycache    all    default
usedbysnapshots   199M   -
usedbydataset 166G   -
usedbychildren    0B -
usedbyrefreservation  0B -
logbias   latency    default
dedup off    default
mlslabel  none   default
sync  standard   default
dnodesize legacy default
refcompressratio  2.40x  -
written   64.9M  -
logicalused   397G   -
logicalreferenced 397G   -
volmode   default    default
filesystem_limit  none   default
snapshot_limit    none   default
filesystem_count  none   default
snapshot_count    none   default
snapdev   hidden default
acltype   off    default
context   none   default
fscontext none   default
defcontext    none   default
rootcontext   none   default
relatime  off    default
redundant_metadata    all    default
overlay   off    default

> your postgresql.conf?

We have a bunch of config files, so I tried to get the resulting config
using "select name, setting from pg_settings where source =
'configuration file'" - hopefully that gives what you wanted.

    name    |   
setting   

Re: Death postgres

2023-05-07 Thread Adrian Klaver

On 5/7/23 04:46, Marc Millas wrote:



Le sam. 6 mai 2023 à 21:46, Adrian Klaver > a écrit :


On 5/6/23 10:13, Marc Millas wrote:
 >
 >

 >     When you restarted the server where there any warnings shown?
 >
 > Sadly, I cannot. Will be done next tuesday.

Cannot do what:

1) Get to the log to see if there are warnings?

2) Restart the server?

Your original post said the server crashed.

If that was the case how can you do any of the below without
restarting it?
Infrastructure is managed by another company. The db is within a
container.

So.. If I run a pg_ctl restart, the container... Vanished.
So for each main thing, I must ask... By mail..


1) The question remains, if the container crashes when you run pg_ctl 
how can you do?:



Using psql can you \d ?

Yes, and no pb to check pg_statistic,...


Can you select from any other table in the database?

Yes


2) You need to provide a more detailed explanation of the setup that the 
Postgres server is running in.


a) The host OS and version.

b) The container specifications.

   1) What container software and version?

   2) 'Hardware' specifications.

c) Some indication of the size of the data sets you are querying. Also 
the query you used.




 >
 >
 >     Using psql can you \d ?
 >
 > Yes, and no pb to check pg_statistic,...
 >
 >
 >     Can you select from any other table in the database?
 >
 > Yes
 >
 >
 >
 >
 >
 >
 >     --
 >     Adrian Klaver
 > adrian.kla...@aklaver.com 
>
 >

-- 
Adrian Klaver

adrian.kla...@aklaver.com 



--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Death postgres

2023-05-07 Thread Laurenz Albe
On Sat, 2023-05-06 at 03:14 +0200, Marc Millas wrote:
> postgres 14.2 on Linux redhat
> 
> temp_file_limit set around 210 GB.
> 
> a select request with 2 left join have crashed the server (oom killer) after 
> the postgres
> disk occupation did grow from 15TB to 16 TB.
> 
> What are the cases where postgres may grow without caring about 
> temp_file_limit ?

That's too little information for a decent answer.
One obvious answer is: if it is not writing temporary files.

Yours,
Laurenz Albe




Hung Query with No Locking Issues

2023-05-07 Thread Michael P. McDonnell
Hey team -
I have 2 stored procedures that need to run back to back. It could
convert to a single one - but it's easier from a maintenance perspective to
keep them separated.

The first procedure effectively is
INSERT INTO table_b () SELECT  FROM _table_a_;
COMMIT;

Total execution time - about 180s. Nothing in the pg_locks table and
nothing in the pg_stat_activity table suggests anything is hung over.

The second procedure mutates table_b data into table_b_collapsed
INSERT INTO table_c () SELECT _ FROM _table_b_ JOIN _table_b as b1_
JOIN _table_b as b2_ JOIN _table_b as b3_, etc...;
COMMIT;

The first time I run my second stored procedure - it hangs for up to 8
hours.
If I immediately cancel and re-run the second stored procedure it runs in 2
seconds.
If I let the query hang and rerun THE EXACT SAME QUERY in another tab of
dbeaver - it runs in 2 seconds.

During the hang - there are no ungranted locks in pg_locks, nothing unusual
in the stats_activity table that I know how to look for. CPU is locked at
100% for the postgres process during the duration of the hung query as well.

Speaking of cpu, this is running on a VMware VM with 8 CPUs and 192 GB of
RAM as a solo VM on a 256 GB / 16 CPU host with SSDs backing the DB locally
on the server (SATA 3 speeds I'm sure) and storage is sitting comfortably
at 20% used and 1% inode utilization.

Not really sure what I'm missing here; I was having these problems even
when they weren't a stored procedure. I was hoping pushing the commit
immediately after the query would solve my problem. Any advice, pointers,
would be appreciated.


Re: Hung Query with No Locking Issues

2023-05-07 Thread Tom Lane
"Michael P. McDonnell"  writes:
> I have 2 stored procedures that need to run back to back. It could
> convert to a single one - but it's easier from a maintenance perspective to
> keep them separated.

> The first procedure effectively is
> INSERT INTO table_b () SELECT  FROM _table_a_;
> COMMIT;

> Total execution time - about 180s. Nothing in the pg_locks table and
> nothing in the pg_stat_activity table suggests anything is hung over.

> The second procedure mutates table_b data into table_b_collapsed
> INSERT INTO table_c () SELECT _ FROM _table_b_ JOIN _table_b as b1_
> JOIN _table_b as b2_ JOIN _table_b as b3_, etc...;
> COMMIT;

> The first time I run my second stored procedure - it hangs for up to 8
> hours.
> If I immediately cancel and re-run the second stored procedure it runs in 2
> seconds.

Perhaps an "ANALYZE table_b" in between would help.

regards, tom lane




Re: Hung Query with No Locking Issues

2023-05-07 Thread Tom Lane
"Michael P. McDonnell"  writes:
> On Sun, May 7, 2023 at 4:25 PM Tom Lane  wrote:
>> Perhaps an "ANALYZE table_b" in between would help.

> Okay - that worked.
> How did you know that would work? That's incredible.

Well, I guessed ;-) ... but it was a well-educated guess.

The symptoms sounded like you were getting a bad plan choice for
the second query, which is unsurprising if you populate a table
from empty and don't give auto-analyze enough time to collect
some statistics behind your back.

As a solution going forward, I'd recommend adding the ANALYZE
as the last step in the table-populating procedure.

regards, tom lane




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-07 Thread Thomas Munro
On Mon, May 8, 2023 at 4:10 AM Evgeny Morozov
 wrote:
> On 6/05/2023 11:13 pm, Thomas Munro wrote:
> > Would you like to try requesting FILE_COPY for a while and see if it 
> > eventually happens like that too?
> Sure, we can try that.

Maybe you could do some one way and some the other, so that we try to
learn more?

> Ubuntu 18.04.6
> Kernel 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023
> x86_64 x86_64 x86_64 GNU/Linux
> zfsutils-linux package version 0.7.5-1ubuntu16.12 amd64

I tried for a few hours to reproduce this by trying to make as many
things as similar to yours as I could based on info in this thread
(Linux: up-to-date Ubuntu 18.04 in vagrant which has nearly the same
kernel 4.15.0-208-generic and a close but slightly different version
of ancient ZFS 0.7.5-1ubuntu15, not sure why, ZFS: mirror (I used a
pair of loopback files), recordsize=16kB, compression=lz4, PG:
compiled from tag REL_15_2, data_checksums=on, full_page_writes=off,
wal_recycle=off, wal_init_zero=off), with what I thought might be
roughly what you're doing (creating three DBs, two clones of the
first, with various modification at various points, with various
overlapping activities, and then checking for catalog corruption).  No
cigar.  Hrmph.




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-07 Thread Michael Paquier
On Sun, May 07, 2023 at 10:30:52PM +1200, Thomas Munro wrote:
> Bug-in-PostgreSQL explanations could include that we forgot it was
> dirty, or some backend wrote it out to the wrong file; but if we were
> forgetting something like permanent or dirty, would there be a more
> systematic failure?  Oh, it could require special rare timing if it is
> similar to 8a8661828's confusion about permanence level or otherwise
> somehow not setting BM_PERMANENT, but in the target blocks, so I think
> that'd require a checkpoint AND a crash.  It doesn't reproduce for me,
> but perhaps more unlucky ingredients are needed.
> 
> Bug-in-OS/FS explanations could include that a whole lot of writes
> were mysteriously lost in some time window, so all those files still
> contain the zeroes we write first in smgrextend().  I guess this
> previously rare (previously limited to hash indexes?) use of sparse
> file hole-punching could be a factor in an it's-all-ZFS's-fault
> explanation:

Yes, you would need a bit of all that.

I can reproduce the same backtrace here.  That's just my usual laptop
with ext4, so this would be a Postgres bug.  First, here are the four
things running in parallel so as I can get a failure in loading a
critical index when connecting:
1) Create and drop a database with WAL_LOG as strategy and the
regression database as template:
while true; do
  createdb --template=regression --strategy=wal_log testdb;
  dropdb testdb;
done
2) Feeding more data to pg_class in the middle, while testing the
connection to the database created:
while true;
  do psql -c 'create table popo as select 1 as a;' regression > /dev/null 2>&1 ;
  psql testdb -c "select 1" > /dev/null 2>&1 ;
  psql -c 'drop table popo' regression > /dev/null 2>&1 ;
  psql testdb -c "select 1" > /dev/null 2>&1 ;
done;
3) Force some checkpoints:
while true; do psql -c 'checkpoint' > /dev/null 2>&1; sleep 4; done
4) Force a few crashes and recoveries:
while true ; do pg_ctl stop -m immediate ; pg_ctl start ; sleep 4 ; done

And here is one backtrace:
(gdb) bt
#0  __pthread_kill_implementation (threadid=, 
signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x7f8a395cad2f in __pthread_kill_internal (signo=6, threadid=) at ./nptl/pthread_kill.c:78
#2  0x7f8a3957bef2 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
#3  0x7f8a39566472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x55f0b36b95ac in errfinish (filename=0x55f0b38d4f68 "relcache.c", 
lineno=4335, funcname=0x55f0b38d6500 <__func__.12> "load_critical_index") at 
elog.c:604
#5  0x55f0b36a9c64 in load_critical_index (indexoid=2662, heapoid=1259) at 
relcache.c:4335
#6  0x55f0b36a9712 in RelationCacheInitializePhase3 () at relcache.c:4110
(gdb) up 5
#5  0x55f0b36a9c64 in load_critical_index (indexoid=2662, heapoid=1259) at 
relcache.c:4335
4335elog(PANIC, "could not open critical system index %u", 
indexoid);

You can also get failures with btree deduplication because of the CTAS
running above, by the way, but that's just the top of the iceberg once
the state is corrupted:
#0  __pthread_kill_implementation (threadid=, 
signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x7fcae38abd2f in __pthread_kill_internal (signo=6, threadid=) at ./nptl/pthread_kill.c:78
#2  0x7fcae385cef2 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
#3  0x7fcae3847472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x5556a809f584 in ExceptionalCondition (conditionName=0x5556a812b639 
"false", fileName=0x5556a812abc8 "heapam.c", lineNumber=7882) at assert.c:66
#5  0x5556a79e13db in index_delete_sort_cmp (deltid1=0x5556a9682158, 
deltid2=0x7ffdb62c7088) at heapam.c:7882
#6  0x5556a79e14f6 in index_delete_sort (delstate=0x7ffdb62c8350) at 
heapam.c:7923
#7  0x5556a79e0cd0 in heap_index_delete_tuples (rel=0x7fcae0f1a9a8, 
delstate=0x7ffdb62c8350) at heapam.c:7580
#8  0x5556a7a07be4 in table_index_delete_tuples (rel=0x7fcae0f1a9a8, 
delstate=0x7ffdb62c8350) at ../../../../src/include/access/tableam.h:1337
#9  0x5556a7a0a7a3 in _bt_delitems_delete_check (rel=0x7fcae0f231c8, 
buf=179, heapRel=0x7fcae0f1a9a8, delstate=0x7ffdb62c8350) at nbtpage.c:1541
#10 0x5556a7a07545 in _bt_simpledel_pass (rel=0x7fcae0f231c8, buffer=179, 
heapRel=0x7fcae0f1a9a8, deletable=0x7ffdb62c8430, ndeletable=160, 
newitem=0x5556a9689260, 

Anyway, you would be able to see that b3e184a (pretty much the same as
15.2), without 8a86618 included.  Once 8a86618 is included, all these
steps are stable in the backend, at least here.  Or do we have some
low-hanging fruit with the WAL_LOG strategy?  That could always be
possible, of course, but that looks like the same issue to me, just
with a different symptom showing up.
--
Michael


signature.asc
Description: PGP signature


Re: "PANIC: could not open critical system index 2662" - twice

2023-05-07 Thread Thomas Munro
On Mon, May 8, 2023 at 2:24 PM Michael Paquier  wrote:
> I can reproduce the same backtrace here.  That's just my usual laptop
> with ext4, so this would be a Postgres bug.  First, here are the four
> things running in parallel so as I can get a failure in loading a
> critical index when connecting:

That sounds like good news, but I'm still confused: do you see all 0s
in the target database (popo)'s catalogs, as reported (and if so can
you explain how they got there?), or is it regression that is
corrupted in more subtle ways also involving 1s?




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-07 Thread Michael Paquier
On Mon, May 08, 2023 at 02:46:37PM +1200, Thomas Munro wrote:
> That sounds like good news, but I'm still confused: do you see all 0s
> in the target database (popo)'s catalogs, as reported (and if so can
> you explain how they got there?), or is it regression that is
> corrupted in more subtle ways also involving 1s?

Nope, I have not been able to confirm that yet without 8a86618.  The
test runs at a high frequency, so that's kind of hard to catch.  I
have not been able to get things in a state where I could look at a
FPW for pg_class or its index, either, in a way similar to Evgeny.
--
Michael


signature.asc
Description: PGP signature