Re: 12.3 replicas falling over during WAL redo

2020-09-04 Thread Ben Chobot
Alvaro Herrera wrote on 8/3/20 4:54 PM: On 2020-Aug-03, Ben Chobot wrote: Alvaro Herrera wrote on 8/3/20 2:34 PM: On 2020-Aug-03, Ben Chobot wrote: dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501 If I use skip instead of seek Argh, yes, I did correct that in my tes

Re: 12.3 replicas falling over during WAL redo

2020-08-04 Thread Kyotaro Horiguchi
At Tue, 4 Aug 2020 09:53:36 -0400, Alvaro Herrera wrote in > On 2020-Aug-03, Alvaro Herrera wrote: > > > > lsn  | checksum | flags | lower | upper | special | pagesize | > > > version | prune_xid > > > --+--+---+---+---+-+--+-+---

Re: 12.3 replicas falling over during WAL redo

2020-08-04 Thread Alvaro Herrera
On 2020-Aug-03, Alvaro Herrera wrote: > > lsn  | checksum | flags | lower | upper | special | pagesize | > > version | prune_xid > > --+--+---+---+---+-+--+-+--- > >  A0A/99BA11F8 | -215 | 0 |   180 |  7240 |    8176

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Alvaro Herrera
On 2020-Aug-03, Ben Chobot wrote: > Alvaro Herrera wrote on 8/3/20 2:34 PM: > > On 2020-Aug-03, Ben Chobot wrote: > > dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501 > > If I use skip instead of seek Argh, yes, I did correct that in my test and forgot to copy and past

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Ben Chobot
Alvaro Herrera wrote on 8/3/20 2:34 PM: On 2020-Aug-03, Ben Chobot wrote: Alvaro Herrera wrote on 8/3/20 12:34 PM: On 2020-Aug-03, Ben Chobot wrote: Yep. Looking at the ones in block 6501, rmgr: Btree   len (rec/tot): 72/    72, tx:   76393394, lsn: A0A/AB2C43D0, prev A0A/AB2C4378,

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Ben Chobot
Peter Geoghegan wrote on 8/3/20 3:04 PM: On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera wrote: You can use pageinspect's page_header() function to obtain the page's LSN. You can use dd to obtain the page from the file, dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501 Ben

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Peter Geoghegan
On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera wrote: > You can use pageinspect's page_header() function to obtain the page's > LSN. You can use dd to obtain the page from the file, > > dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501 Ben might find this approach to dumping

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Alvaro Herrera
On 2020-Aug-03, Ben Chobot wrote: > Alvaro Herrera wrote on 8/3/20 12:34 PM: > > On 2020-Aug-03, Ben Chobot wrote: > > > > Yep. Looking at the ones in block 6501, > > > > > rmgr: Btree   len (rec/tot): 72/    72, tx:   76393394, lsn: > > > A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LE

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Ben Chobot
Alvaro Herrera wrote on 8/1/20 9:35 AM: On 2020-Aug-01, Ben Chobot wrote: Can you find out what the index is being modified by those LSNs -- is it always the same index?  Can you have a look at nearby WAL records that touch the same page of the same index in each case? They turn out to be di

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Ben Chobot
Peter Geoghegan wrote on 8/3/20 11:25 AM: On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi wrote: All of the cited log lines seem suggesting relation with deleted btree page items. As a possibility I can guess, that can happen if the pages were flushed out during a vacuum after the last checkpo

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Ben Chobot
Alvaro Herrera wrote on 8/3/20 12:34 PM: On 2020-Aug-03, Ben Chobot wrote: Yep. Looking at the ones in block 6501, rmgr: Btree   len (rec/tot): 72/    72, tx:   76393394, lsn: A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel 16605/16613/60529051 blk 6501 rmgr:

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Alvaro Herrera
On 2020-Aug-03, Ben Chobot wrote: > rmgr: Btree   len (rec/tot): 72/    72, tx:   76396065, lsn: > A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel > 16605/16613/60529051 blk 6501 > > So then I did: > > /usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Ben Chobot
Ben Chobot wrote on 8/1/20 9:58 AM: Alvaro Herrera wrote on 8/1/20 9:35 AM: On 2020-Aug-01, Ben Chobot wrote: Can you find out what the index is being modified by those LSNs -- is it always the same index?  Can you have a look at nearby WAL records that touch the same page of the same index in

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Peter Geoghegan
On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi wrote: > All of the cited log lines seem suggesting relation with deleted btree > page items. As a possibility I can guess, that can happen if the pages > were flushed out during a vacuum after the last checkpoint and > full-page-writes didn't resto

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Ben Chobot
Kyotaro Horiguchi wrote on 8/2/20 9:39 PM: At Sat, 1 Aug 2020 09:58:05 -0700, Ben Chobot wrote in All of the cited log lines seem suggesting relation with deleted btree page items. As a possibility I can guess, that can happen if the pages were flushed out during a vacuum after the last checkpoi

Re: 12.3 replicas falling over during WAL redo

2020-08-02 Thread Kyotaro Horiguchi
At Sat, 1 Aug 2020 09:58:05 -0700, Ben Chobot wrote in > > > Alvaro Herrera wrote on 8/1/20 9:35 AM: > > On 2020-Aug-01, Ben Chobot wrote: > > > >> We have a few hundred postgres servers in AWS EC2, all of which do > >> streaming > >> replication to at least two replicas. As we've transitioned

Re: 12.3 replicas falling over during WAL redo

2020-08-01 Thread Ben Chobot
Alvaro Herrera wrote on 8/1/20 9:35 AM: On 2020-Aug-01, Ben Chobot wrote: We have a few hundred postgres servers in AWS EC2, all of which do streaming replication to at least two replicas. As we've transitioned our fleet to from 9.5 to 12.3, we've noticed an alarming increase in the frequenc

Re: 12.3 replicas falling over during WAL redo

2020-08-01 Thread Alvaro Herrera
On 2020-Aug-01, Ben Chobot wrote: > We have a few hundred postgres servers in AWS EC2, all of which do streaming > replication to at least two replicas. As we've transitioned our fleet to > from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a > streaming replica dying during