Streaming replication fails after some time with 'incorrect resource manager data checksum'

2019-12-18 Thread Julian Backes
Hello all!

I already posted in the slack chat #help channel but got no answer :-(

We have a read only / hot standby system and are facing the same problem as
described in
https://stackoverflow.com/questions/35752389/incorrect-resource-manager-data-checksum-in-record-at-2-xyz-terminating-walrec
(the post is already 3 years old).

That means after some time (sometimes two days, sometimes half a day),
postgres starts logging 'incorrect resource manager data checksum in record
at xyz' and shuts down wal receiver (and stops streaming replication).

Master and slave are running on Ubuntu 18.04, Postgres 12.1, ext4 file
system (no zfs or btrfs, just lvm on the master); we only use ecc memory
(192 gb on the master and 256 gb on the slave) and nvme ssds on both
servers using a soft raid 1.
When the error occurs, a restart of postgres on the slave "fixes" the
problem.

Any ideas what we can do to prevent/investigate the problem?

Kind regards
Julian


Re: ERROR: too many dynamic shared memory segments

2020-01-29 Thread Julian Backes
Hi,

we only had the "too many shared too many dynamic shared memory segments"
error but no segmentation faults. The error started occurring after
upgrading from postgres 10 to postgres 12 (server has 24 cores / 48
threads, i.e. many parallel workers). The error itself was not that much of
a problem but /dev/shm started filling up with orphaned files which
probably (?) had not been cleaned up by postgres after the parallel workers
died. In consequence, after some time, /dev/shm was full and everything
crashed.

Unfortunately, the only "solution" we found so far was to increase max
connections from 100 to 1000. After that (about 2 months ago I think), the
error had gone.

Maybe this helps...

Julian

Am Mi., 29. Jan. 2020 um 10:37 Uhr schrieb Nicola Contu <
nicola.co...@gmail.com>:

> This is the error on postgres log of the segmentation fault :
>
> 2020-01-21 14:20:29 GMT [] [4]: [108-1] db=,user= LOG:  server process
> (PID 2042) was terminated by signal 11: Segmentation fault
> 2020-01-21 14:20:29 GMT [] [4]: [109-1] db=,user= DETAIL:  Failed
> process was running: select pid from pg_stat_activity where query ilike
> 'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
> 2020-01-21 14:20:29 GMT [] [4]: [110-1] db=,user= LOG:  terminating
> any other active server processes
> 2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [4-1]
> db=cmdv3,user=admin WARNING:  terminating connection because of crash of
> another server process
> 2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [5-1]
> db=cmdv3,user=admin DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> 2020-01-21 14:20:29 GMT [127.0.0.1(34256)] [2724]: [6-1]
> db=cmdv3,user=admin HINT:  In a moment you should be able to reconnect to
> the database and repeat your command.
> 2020-01-21 14:20:29 GMT [127.0.0.1(34026)] [2055]: [5-1]
> db=cmdv3,user=admin WARNING:  terminating connection because of crash of
> another server process
>
> At CentOS level :
> Jan 21 14:20:29 usnyh-cmd1.gt-t.net kernel: postmaster[2042]: segfault at
> 0 ip 0048bef4 sp 7ffdf4955bb0 error 4 in postgres[40+6c5000]
> Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905
> GMT [33727] WARNING C-0x21526e8: cmdv3/admin@10.151.2.154:39688 pooler
> error: server conn crashed?
> Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905
> GMT [33727] WARNING S-0x1f5ff58: cmdv3/admin@127.0.0.1:5432 got packet
> 'N' from server when not linked
> Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.905
> GMT [33727] WARNING S-0x25097f0: cmdv3/admin@127.0.0.1:5432 got packet
> 'N' from server when not linked
> Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.906
> GMT [33727] WARNING S-0x2508b60: cmdv3/admin@127.0.0.1:5432 got packet
> 'N' from server when not linked
> Jan 21 14:20:29 usnyh-cmd1.gt-t.net pgbouncer: 2020-01-21 14:20:29.981
> GMT [33727] ERROR S: login failed: FATAL: the database system is in
> recovery mode
>
>
> > If you're on Linux, you can probably see them with "ls /dev/shm".
>
> I see a lot of files there, and doing a cat they are empty. What can I do
> with them?
>
> Those are two different problems I guess, but they are related because
> right before the Segmentation Fault I see a lot of shared segment errors in
> the postgres log.
>
> Il giorno mer 29 gen 2020 alle ore 10:09 Thomas Munro <
> thomas.mu...@gmail.com> ha scritto:
>
>> On Wed, Jan 22, 2020 at 4:06 AM Nicola Contu 
>> wrote:
>> > after a few months, we started having this issue again.
>> > So we revert the work_mem parameter to 600MB instead of 2GB.
>> > But the issue is still there. A query went to segmentation fault, the
>> DB went to recovery mode and our app went to read only for a few minutes.
>>
>> Hi Nicola,
>> Hmm, a segmentation fault sounds like a different problem.  Can you
>> please share the exact error messages from PostgreSQL and OS logs?
>>
>> > I understand we can increase max_connections so we can have many more
>> segments.
>> >
>> > My question is : is there a way to understand the number of segments we
>> reached?
>>
>> If you're on Linux, you can probably see them with "ls /dev/shm".
>>
>> > Currently we have 220 max_conn so as your formula is 64 + 2* 220 we
>> have about 500 shared segments.
>> > We would like to increase that number to 300 or 400 but would be great
>> to understand if there is a way to make sure we will solve the issue as it
>> requires a restart of the service.
>> >
>> > I know you were also talking about a redesign this part in PostgreSQL.
>> Do you know if anything has changed in any of the newer versions after 11.5?
>>
>> It's possible that we should increase a couple of constants used the
>> formula -- I'll look into that again.  But first I'd like to see if
>> we're even investigating the right problem here.
>>

Re: Sequence vs UUID

2023-02-06 Thread Julian Backes
I don't really understand what you mean by 'performance'. To me it is not
surprising that incrementing (I know it is not just incrementing) a
64bit integer is faster than generating 128 bit data with a good amount of
random data even if it seems to be too slow. So in my opinion you need to
separate
1) generating data (which might happen on the client in case of UUID and
not in the db...)
2) inserting data
3) selecting data

in both sequential as well as parallel scenarios.

Am Mo., 6. Feb. 2023 um 19:32 Uhr schrieb veem v :

> So, it may be the machine on which the code is getting executed behind the
> scene , in the site "https://dbfiddle.uk/"; is playing a key role in the
> speed, however, the comparative performance of UUID vs sequence should stay
> the same.
>  So I think, after this test we can safely conclude that if we compare
> the performance of the UUID(both version-4, version 7) VS sequence. The
> UUID performs a lot worse as compared to sequence. So unless there exists
> some strong reason/justification for UUID, we should default use the
> sequence. Correct me if I'm wrong. And also  I understand the cases
> of multi master replication/sharding etc, may be a factor but other than
> that I can't think of any scenario where sequences can be used.
>
>
>
> On Fri, 3 Feb 2023 at 23:07, Dominique Devienne 
> wrote:
>
>> On Fri, Feb 3, 2023 at 5:48 PM veem v  wrote:
>>
>>> Actually I did the testing by connecting to "https://dbfiddle.uk/";
>>> postgres version -15.
>>>
>>> PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0
>>> 20210514 (Red Hat 8.5.0-10), 64-bit
>>>
>>> Am I doing it wrong, please confirm?
>>>
>>>
>> No clue. But even SQLite can generate 1M uuid (16 bytes random blobs are
>> equivalent) is 1/3 of the time, so 30x faster.
>> So your timings of generating 100K uuids and counting them seems way too
>> slow to me. --DD
>>
>> sqlite> select count(randomblob(16)) from generate_series(1,1000*1000);
>> QUERY PLAN
>> `--SCAN generate_series VIRTUAL TABLE INDEX 3:
>> addr  opcode p1p2p3p4 p5  comment
>>   -        -  --  -
>> 0 Init   0 1500   Start at 15
>> 1 Null   0 1 10   r[1..1]=NULL
>> 2 VOpen  0 0 0 vtab:274D3E0   0
>> 3 Integer1 4 00   r[4]=1
>> 4 Multiply   6 6 50   r[5]=r[6]*r[6]
>> 5 Integer3 2 00   r[2]=3
>> 6 Integer2 3 00   r[3]=2
>> 7 VFilter0 1120   iplan=r[2]
>> zplan=''
>> 8   Function   1 8 7 randomblob(1)  0
>> r[7]=func(r[8])
>> 9   AggStep0 7 1 count(1)   1   accum=r[1]
>> step(r[7])
>> 10VNext  0 8 00
>> 11AggFinal   1 1 0 count(1)   0   accum=r[1] N=1
>> 12Copy   1 9 00   r[9]=r[1]
>> 13ResultRow  9 1 00   output=r[9]
>> 14Halt   0 0 00
>> 15Transaction0 0 1 0  1
>> usesStmtJournal=0
>> 16Integer1000  6 00   r[6]=1000
>> 17Integer168 00   r[8]=16
>> 18Goto   0 1 00
>> ┌───┐
>> │ count(randomblob(16)) │
>> ├───┤
>> │ 100   │
>> └───┘
>> Run Time: real 0.278 user 0.25 sys 0.00
>>
>>
>>> On Fri, 3 Feb 2023 at 21:28, Dominique Devienne 
>>> wrote:
>>>
 Something's off regarding Guid generations IMHO...
 You generate 100K Guids in ~1s. While we generate (in C++, Windows
 Release, using Boost) 16M of them in +/- the same time:

>>>