Re: Postgres PAF setup

2018-04-24 Thread Adrien Nayrat
On 04/23/2018 08:09 PM, Andrew Edenburn wrote:
> I am having issues with my PAF setup.  I am new to Postgres and have setup the
> cluster as seen below. 
> 
> I am getting this error when trying to start my cluster resources.
> 
>  
> 
> Master/Slave Set: pgsql-ha [pgsqld]
> 
>  pgsqld (ocf::heartbeat:pgsqlms):   FAILED dcmilphlum224 
> (unmanaged)
> 
>  pgsqld (ocf::heartbeat:pgsqlms):   FAILED dcmilphlum223 
> (unmanaged)
> 
> pgsql-master-ip    (ocf::heartbeat:IPaddr2):   Started dcmilphlum223
> 
>  
> 
> Failed Actions:
> 
> * pgsqld_stop_0 on dcmilphlum224 'unknown error' (1): call=239, 
> status=complete,
> exitreason='Unexpected state for instance "pgsqld" (returned 1)',
> 
>     last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=95ms
> 
> * pgsqld_stop_0 on dcmilphlum223 'unknown error' (1): call=248, 
> status=complete,
> exitreason='Unexpected state for instance "pgsqld" (returned 1)',
> 
>     last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=89ms
> 
>  
> 
> cleanup and clear is not fixing any issues and I am not seeing anything in the
> logs.  Any help would be greatly appreciated.
> 
>  

Hello Andrew,

Could you enable debug logs in Pacemaker?

With Centos you have to edit PCMK_debug variable in /etc/sysconfig/pacemaker :

PCMK_debug=crmd,pengine,lrmd

This should give you more information in logs. Monitor action in PAF should
report why the cluster doesn't start :
https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L1525

Regards,

-- 
Adrien NAYRAT



signature.asc
Description: OpenPGP digital signature


Re: [ClusterLabs] 答复: Postgres PAF setup

2018-04-25 Thread Adrien Nayrat
On 04/25/2018 02:31 AM, 范国腾 wrote:
> I have meet the similar issue when the postgres is not stopped normally. 
>  
> You could run pg_controldata to check if your postgres status is 
> shutdown/shutdown in recovery.
> 
> I change the /usr/lib/ocf/resource.d/heartbeat/pgsqlms to avoid this problem:
> 
> elsif ( $pgisready_rc == 2 ) {
> # The instance is not listening.
> # We check the process status using pg_ctl status and check
> # if it was propertly shut down using pg_controldata.
> ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening',
> $OCF_RESOURCE_INSTANCE );
> # return _confirm_stopped();   # remove this line
> return $OCF_NOT_RUNNING; 
> }

Hello,

It is a bad idea. The goal of _confirm_stopped is to check if the instance was
properly stopped. If it wasn't you could corrupt your instance.

_confirm_stopped  return $OCF_NOT_RUNNING only if the instance was properly
shutdown :
elsif ( $controldata_rc == $OCF_NOT_RUNNING ) {
# The controldata state is consistent, the instance was probably
# propertly shut down.
ocf_log( 'debug',
'_confirm_stopped: instance "%s" controldata indicates that the
instance was propertly shut down',
$OCF_RESOURCE_INSTANCE );
return $OCF_NOT_RUNNING;
}

Regards,


-- 
Adrien NAYRAT




signature.asc
Description: OpenPGP digital signature


Re: Suggestion about logging only every n-th statement

2018-06-20 Thread Adrien NAYRAT

On 06/20/2018 01:51 PM, Janning Vygen wrote:
Back in 2009 I made a suggestion which is not implemented yet but would 
still be very valuable and easy to implement in my opinion (not for me 
as I am not speaking C):


https://www.postgresql.org/message-id/flat/200903161426.56662.vygen%40kicktipp.de#200903161426.56662.vy...@kicktipp.de 



We still can't afford to log each and every statement as we have too 
many and we don't want to afford more disk performance/space. Therefore 
we log only statements slower than 50ms.


But for analyzing usage patterns it would be very nice to have this 
combined with a sample_rate for logging.


   logging_sample_rate = n

So each n-th statement will get logged regardless of execution time.

What do you think?

regards
Janning




Hello Janning,

FYI in made this patch which seems do what you want :
https://www.postgresql.org/message-id/flat/c30ee535-ee1e-db9f-fa97-146b9f62caed%40anayrat.info#c30ee535-ee1e-db9f-fa97-146b9f62c...@anayrat.info


I will add an entry in september's commit fest.

Regards,



Re: Suggestion about logging only every n-th statement

2018-06-20 Thread Adrien Nayrat
On 06/20/2018 03:08 PM, Andreas Kretschmer wrote:
> I would suggest using of pg_stat_statements for that.

I agree, but sometimes it is not sufficient. For example, if you need query's
parameters.

-- 
Adrien NAYRAT




signature.asc
Description: OpenPGP digital signature


Re: Suggestion about logging only every n-th statement

2018-06-24 Thread Adrien Nayrat
On 06/20/2018 03:06 PM, Janning Vygen wrote:
>> FYI in made this patch which seems do what you want :
>> https://www.postgresql.org/message-id/flat/c30ee535-ee1e-db9f-fa97-146b9f62caed%40anayrat.info#c30ee535-ee1e-db9f-fa97-146b9f62c...@anayrat.info
>>
>>
>> I will add an entry in september's commit fest.
> 
> excellent. Exactly what I want!

FYI, I added an entry in next commit fest (september is not open) :
https://commitfest.postgresql.org/18/1691/

Feel free to review it ;)

Regards,

-- 
Adrien NAYRAT



signature.asc
Description: OpenPGP digital signature


Re: Using COPY to import large xml file

2018-06-24 Thread Adrien Nayrat
On 06/24/2018 05:25 PM, Anto Aravinth wrote:
> Hello Everyone,
> 
> I have downloaded the Stackoverflow posts xml (contains all SO questions till
> date).. the file is around 70GB.. I wanna import the data in those xml to my
> table.. is there a way to do so in postgres?
> 
> 
> Thanks, 
> Anto.

Hello Anto,

I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgres

Regards,

-- 
Adrien NAYRAT
https://blog.anayrat.info



signature.asc
Description: OpenPGP digital signature


Re: Using COPY to import large xml file

2018-06-24 Thread Adrien Nayrat
On 06/24/2018 06:07 PM, Anto Aravinth wrote:
> Thanks for the response. I'm not sure, how long does this tool takes for the
> 70GB data. 

In my memory, it took several hours. I can't remember if it is xml conversion or
insert which are longer.

> 
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took 
> around
> 2 hrs using nodejs. Hence, I thought will use COPY command, as suggested on 
> the
> internet. 
> 
> Definitely, will try the code and let you know.. But looks like it uses the 
> same
> INSERT, not copy.. interesting if it runs quick on my machine. 

Yes it use INSERT, maybe it is not difficult to change the code to use COPY 
instead.

-- 
Adrien NAYRAT
https://blog.anayrat.info



signature.asc
Description: OpenPGP digital signature


Re: Sv: Re: CTE optimization fence

2018-06-27 Thread Adrien NAYRAT

On 06/27/2018 09:58 AM, Andreas Joseph Krogh wrote:

 >
 > but we have to settle on a way of controlling it.

+1 from me.

I am running more and more into situations where people consider
this a bug rather than a feature.

FWIW, I think a GUC that switches between the current (mostly
unwanted, at least surprising)
way and one where the CTE is optimized together with the main query
would suit "most" people.

For sake of compatibility this could default to the current behaviour

+1 from me. The default should be "no fence" for sake of least surprise 
I think. Documenting the change would be sufficient.

I hope this will be picked up in the comming V12-cycle.



FYI this subject has been discussed in this thread : 
https://www.postgresql.org/message-id/5351711493487900%40web53g.yandex.ru


Regards,



Re: Query to monitor index bloat

2018-07-16 Thread Adrien NAYRAT

On 07/16/2018 05:16 PM, Alessandro Aste wrote:
Hello,  I am trying to put togheter a query to monitor the index bloat 
for a database I maintain.
Is there a "SQL" way to obtain  bloated index ? I googled around but I 
found nothing working.


I'm currently running 9.6 but I'm looking for something compatible with 
version 10 too.


Thank you very much in advance,


Alessandro.


Hello,

You should look at : https://github.com/ioguix/pgsql-bloat-estimation

Regards,



Re: Query to monitor index bloat

2018-07-16 Thread Adrien Nayrat
On 07/16/2018 11:50 PM, Alessandro Aste wrote:
> Thanks much, I'll check that out.  I see the queries are 3 years old so I'm
> wondering if they still work for 9.6.x or 10

AFAIK they work. You can try yourself by comparing bloated table/index size
before and after vacuum full or reindex ;)




signature.asc
Description: OpenPGP digital signature


Re: Shared buffers increased but cache hit ratio is still 85%

2018-07-19 Thread Adrien NAYRAT

On 07/18/2018 10:26 AM, Hans Schou wrote:

Am I doing something wrong or should some history be cleared?


Hi,

FIY, check_pgactivity save the diff between each call to compute the 
real hit ratio : https://github.com/OPMDG/check_pgactivity


Regards,



Re: Postgres - search for value throughout many tables?

2018-08-10 Thread Adrien NAYRAT

On 08/08/2018 04:09 PM, czezz wrote:

Hi everyone,
I want to aks if anyone knows is there a way to search for specific 
"value" throughout list of tables OR all tables in databse?


Cheers,
czezz


Hello,

Maybe this article by Daniel could help you : 
https://blog-postgresql.verite.pro/2017/06/06/global-search.html


Unfortunately it is not translated, you have to play with a translate tool.

Regards,



Re: Fwd: Postgres HA - pacemaker RA do not support auto failback

2019-08-12 Thread Adrien Nayrat
On 8/12/19 2:57 PM, Shital A wrote:
> 
> 
> Hello,
> 
> Postgres version : 9.6
> OS:Rhel 7.6
> 
> We are working on HA setup for postgres cluster of two nodes in active-passive
> mode. 
> 
> Installed: 
> Pacemaker 1.1.19
> Corosync 2.4.3
> 
> The pacemaker agent with this installation doesn't support automatic failback.
> What I mean by that is explained below:
> 1. Cluster is setup like A - B with A as master. 
> 2. Kill services on A, node B will come up as master.
> 3. node A is ready to join the cluster, we have to delete the lock file it
> creates on any one of the node and execute the cleanup command to get the node
> back as standby
> 
> Step 3 is manual so HA is not achieved in real sense. 
> 
> Please help to check:
> 1. Is there any version of the resouce agent which supports automatic 
> failback?
> To avoid generation of lock file and deleting it.
> 
> 2. If there is no such support, what checks should be added in pgsql RA to 
> achieve 
> 
> Please suggest.
> Thanks.
> 

Hello,

Which RA did you use? AFAIK there is two RA :
- pgsql : 
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
- PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql RA :
http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)

If I am right, PAF should handle the case when there no need to perform pgrewind
(if the old master is not too advanced in the transaction log).

Regards,

-- 
Adrien NAYRAT
https://blog.anayrat.info



signature.asc
Description: OpenPGP digital signature


PITR and instance without any activity

2022-12-12 Thread Adrien Nayrat

Hello,

I came across a weird problem about PITR with a recovery_target_time and 
a cluster without any activity. The issue is quite simple, we just try 
to do a PITR by specifying a recovery_target_time after the backup.


The main issue, is that there is no timestamp in the wal as there is no 
activity. Postgres fetches the record timestamp thanks to 
getRecordTimestamp:


/* 

 * Extract timestamp from WAL record. 

 * 

 * If the record contains a timestamp, returns true, and saves the 
timestamp
 * in *recordXtime. If the record type has no timestamp, returns false. 

 * Currently, only transaction commit/abort records and restore points 
contain
 * timestamps. 

 */ 



So, Postgres tries to replay all the wal until the end, and we got this 
message:

FATAL:  recovery ended before configured recovery target was reached

Before 13, promotion was done without any error:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dc788668bb269b10a108e87d14fefd1b9301b793

I wonder if we may add a timestamp for other wal records? Maybe when we 
switch xlog?


I may have spotted a bug, I tried to create restore point. I expected 
Postgres will get its timestamp as comment in getRecordTimestamp.
But Postgres did not find any timestamp and keep replaying the wal even 
after the restore point.


FYI, It seems I am not alone with this issue:
https://github.com/pgbackrest/pgbackrest/issues/1934
https://github.com/pgbackrest/pgbackrest/issues/1526#issuecomment-937759691

Thanks

--
Adrien NAYRAT





Re: PITR and instance without any activity

2023-01-09 Thread Adrien Nayrat

Hello,

I bump this thread up, I hope to have more reaction :)

TL;DR:

* We can't perform PITR on a cluster without any activity since 13
* It seems creating restore point doesn't record a timestamp in wal.

Thanks

--
Adrien NAYRAT






Re: PITR and instance without any activity

2023-01-11 Thread Adrien Nayrat

On 1/9/23 11:23, Torsten Förtsch wrote:
On Mon, Jan 9, 2023 at 10:59 AM Adrien Nayrat 
mailto:adrien.nay...@anayrat.info>> wrote:



* We can't perform PITR on a cluster without any activity since 13
* It seems creating restore point doesn't record a timestamp in wal.

I have a cron job that runs this every 5 minutes:

SELECT txid_current() WHERE 
(pg_last_committed_xact()).timestamp+'5min'::INTERVAL < now()


  Combine that with a suitable value for archive_timeout.


Hello,

I don't get how it could be useful ?
When we perform PITR, we don't know if there was activity or not.

--
Adrien NAYRAT






Re: ERROR: found multixact from before relminmxid

2018-11-05 Thread Adrien NAYRAT

On 6/8/18 8:30 PM, Jeremy Finzel wrote:


  No I was referring to this from the documentation:

Avoid spuriously marking pages as all-visible (Dan Wood, Pavan
Deolasee, Álvaro Herrera)

This could happen if some tuples were locked (but not deleted).
While queries would still function correctly, vacuum would normally
ignore such pages, with the long-term effect that the tuples were
never frozen. In recent releases this would eventually result in
errors such as "found multixact n from before relminmxid n".




Hello hackers,

We got the same issue on a 9.6.10, autovacuum reports the same error 
when he tried to freeze a table:

ERROR:  found multixact 37009963 from before relminmxid 48606990

autovacuum was not able to freeze this table until postgres reach 1 
million transactions before wraparound and refuse any new transaction.


We have an OLTP workload and I noticed queries like SELECT .. FOR SHARE. 
I checked durability settings (fsync, fpw) everything seems fine. Also, 
I did not notice any error on the storage (local ssd with raid controler).


We "solved" with a dump/restore on another server, also we kept previous 
cluster to investigate. I want to be sure we encountered the bug solved 
in 9.6.9 and it is not a new one.


If we confirm it, maybe we should advise users to perform integrity 
check? I was surprised 9.6.9 avoid new appearance of corruption but 
nothing to ensure if database is already corrupted.


FYI, we tried to do a pgdump on secondary:
pg_dump: Error message from server: ERROR:  could not access status of 
transaction 1323321209

DETAIL:  Could not open file "pg_clog/04EE": No such file or directory.

I am surprised this clog file is missing on the secondary (but present 
on primary) :

[...]
04EB
04EC
04ED
0CEE
0CEF
0CF0
0CF1
[...]

I ran pg_visibility's checks, they do not reveal any corruption :
select pg_check_visible(i) from (select nspname ||'.' || relname as i 
from pg_class join pg_namespace on 
pg_class.relnamespace=pg_namespace.oid WHERE relkind IN ('r','t','m')) as i;

(0 rows)

select pg_check_frozen(i) from (select nspname ||'.' || relname as i 
from pg_class join pg_namespace on 
pg_class.relnamespace=pg_namespace.oid WHERE relkind IN ('r','t','m')) as i;

(0 rows)

Here is pg_controldata and information on the table where autovacuum 
reports error :

/usr/pgsql-9.6/bin/pg_controldata
pg_control version number:960
Catalog version number:   201608131
Database system identifier:   6451990604639161176
Database cluster state:   in production
pg_control last modified: Mon 05 Nov 2018 12:44:30 PM CET
Latest checkpoint location:   B9AF/70A4FD98
Prior checkpoint location:B9AF/70A4B7D0
Latest checkpoint's REDO location:B9AF/70A4FD60
Latest checkpoint's REDO WAL file:0001B9AF0070
Latest checkpoint's TimeLineID:   1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:  1:1323325854
Latest checkpoint's NextOID:  1665327960
Latest checkpoint's NextMultiXactId:  60149587
Latest checkpoint's NextMultiOffset:  174269996
Latest checkpoint's oldestXID:3471809503
Latest checkpoint's oldestXID's DB:   16393
Latest checkpoint's oldestActiveXID:  1323325854
Latest checkpoint's oldestMultiXid:   48606990
Latest checkpoint's oldestMulti's DB: 16393
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:Mon 05 Nov 2018 12:44:29 PM CET
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline:   0
Backup start location:0/0
Backup end location:  0/0
End-of-backup record required:no
wal_level setting:replica
wal_log_hints setting:off
max_connections setting:  750
max_worker_processes setting: 8
max_prepared_xacts setting:   0
max_locks_per_xact setting:   64
track_commit_timestamp setting:   off
Maximum data alignment:   8
Database block size:  8192
Blocks per segment of large relation: 131072
WAL block size:   8192
Bytes per WAL segment:16777216
Maximum length of identifiers:64
Maximum columns in an index:  32
Maximum size of a TOAST chunk:1996
Size of a large-object chunk: 2048
Date/time type storage:   64-bit integers
Float4 argument passing:  by value
Float8 argument passing:  by value
Data page checksum version:   0

SELECT * FROM pg_class WHERE oid = 'anonymised'::regclass;
-[ RECORD 1 
]---+---

relname | a

Re: ERROR: found multixact from before relminmxid

2018-11-07 Thread Adrien NAYRAT

On 11/5/18 5:41 PM, Adrien NAYRAT wrote:
We "solved" with a dump/restore on another server, also we kept previous 
cluster to investigate. I want to be sure we encountered the bug solved 
in 9.6.9 and it is not a new one.


FYI, I am not sure I will be able to keep previous cluster many days.



If we confirm it, maybe we should advise users to perform integrity 
check? I was surprised 9.6.9 avoid new appearance of corruption but 
nothing to ensure if database is already corrupted.


If I understand correctly, it seems related to vacuum skipping page 
already frozen. Maybe we should advise to do a VACUUM FREEZE with 
DISABLE_PAGE_SKIPPING option?


Regards,



Re: ERROR: found multixact from before relminmxid

2018-11-09 Thread Adrien NAYRAT

On 11/7/18 1:21 PM, Alexandre Arruda wrote:
The best solution that I have found is kick all connections and execute 
a select for update to /dev/null in the affected tables, i.e.:


psql -o /dev/null -c "select * from table for update" database

After this, the vacuum is executed w/o problems again.

Best regards,

Alexandre


Thanks Alexandre, I will try.



Re: pg_dump out of memory for large table with LOB

2018-11-11 Thread Adrien Nayrat
Hello,

On 11/10/18 12:49 AM, Jean-Marc Lessard wrote:
> The dumped table is 0.5TB, 17 million rows and LOB uses about 99% of the 
> space.
> 

If I understand, you have 17 million Large Object?

I do not recall exactly and maybe I am wrong. But it seems pg_dump has to
allocate memory for each object to dump :
addBoundaryDependencies:

for (i = 0; i < numObjs; i++)
[...]

case DO_BLOB_DATA:
/* Data objects: must come between the boundaries */
addObjectDependency(dobj, preDataBound->dumpId);
addObjectDependency(postDataBound, dobj->dumpId);
break;

addObjectDependency:

[...]
pg_malloc(dobj->allocDeps * sizeof(DumpId));


With 17 million LO, it could eat lot of memory ;)



signature.asc
Description: OpenPGP digital signature


Re: Snowball Stemming Greek support not included in Release 12 beta

2019-06-07 Thread Adrien Nayrat
On 6/7/19 10:31 AM, Tasos wrote:
> Hi to all,
>  
> Since Oct 2018 - 
> Greekhttps://www.postgresql.org/message-id/E1g4YQa-0003tf-Gr%40gemulon.postgresql.org
>  stemming algorithm contributed by Oleg Smirnov is in
> inclued in the Snowball project.
>  
> I believe it can be included in the Release 12 of postgres 
>  
> Best Regards,
> Bard

Hi,

Postgres' stemmer has been synced with Snowball upstream, but it was before 
October:

https://www.postgresql.org/message-id/E1g4YQa-0003tf-Gr%40gemulon.postgresql.org

As mentionned, this will add arabic, indonesian, irish, lithuanian,
nepali, and tamil.

We are now in 12 beta 1 and far after feature freeze, so I am afraid your
request to add greek could only be considered for v13. Sorry.

Regards,

-- 
Adrien



signature.asc
Description: OpenPGP digital signature


Re: Snowball Stemming Greek support not included in Release 12 beta

2019-06-07 Thread Adrien Nayrat
On 6/7/19 12:19 PM, Tasks wrote:
> 😢 is there a solution to that? 
> Is it possible that we add it ourselves ???
> 
> Has anyone worked with this extension -> 
> https://github.com/postgrespro/snowball_ext
> 
> Tassos
> 


I did not played with this extension, maybe you can add greek to
https://github.com/postgrespro/snowball_ext/tree/master/libstemmer

You should ask to the authors of this extension ;)
https://github.com/postgrespro/snowball_ext#authors

-- 
Adrien



signature.asc
Description: OpenPGP digital signature


Re: Statistics tables not being updated anymore

2019-07-02 Thread Adrien Nayrat
Hello,

FYI, check_pgactivity has a service to detect frozen stat collector:
https://github.com/OPMDG/check_pgactivity#user-content-stat_snapshot_age-9.5

We added this service after a customer has deactivated IPv6 that broke stat
collector.

Regards,



signature.asc
Description: OpenPGP digital signature


Re: "invalid contrecord" error on replica

2023-12-21 Thread Adrien Nayrat

I received a question about this error.
Just for the record, if someone encounter the same issue.
It has been fixed in ff9f111bce24

TL;DR : update your instance :)

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff9f111bce24

 Fix WAL replay in presence of an incomplete record

Physical replication always ships WAL segment files to replicas once
they are complete.  This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
  LOG:  invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be.  A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary.  But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.

A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.

This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts.  With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.

Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.

A new TAP test that exercises this is added, but the portability of it
is yet to be seen.

This has been wrong since the introduction of physical replication, so
backpatch all the way back.  In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.




Re: "invalid contrecord" error on replica

2021-05-02 Thread Adrien Nayrat
tot): 76/76, tx:   55621935, lsn: 
AA/A0FFFB20, prev AA/A0FFFAF8, desc: HOT_UPDATE off 83 xmax 55621935 flags 0x20 
; new off 19 xmax 0, blkref #0: rel 1663/167087/167363 blk 2662


Fortunately I archive wal, and on the primary, the end of this wal is:
rmgr: Transaction len (rec/tot): 34/34, tx:   55621934, lsn: 
AA/A0FFFAF8, prev AA/A0FFE7F8, desc: COMMIT 2021-05-01 17:25:51.949124 CEST
rmgr: Heaplen (rec/tot): 76/76, tx:   55621935, lsn: 
AA/A0FFFB20, prev AA/A0FFFAF8, desc: HOT_UPDATE off 83 xmax 55621935 flags 0x20 
; new off 19 xmax 0, blkref #0: rel 1663/167087/167363 blk 2662
rmgr: XLOGlen (rec/tot):114/   114, tx:  0, lsn: 
AA/A0FFFB70, prev AA/A0FFFB20, desc: CHECKPOINT_SHUTDOWN redo AA/A0FFFB70; tli 
1; prev tli 1; fpw true; xid 0:55621936; oid 1401408; multi 401; offset 802; 
oldest xid 478 in DB 13397; oldest multi 1 in DB 13397; oldest/newest commit 
timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: Standby len (rec/tot): 50/50, tx:  0, lsn: 
AA/A0FFFBE8, prev AA/A0FFFB70, desc: RUNNING_XACTS nextXid 55621936 
latestCompletedXid 55621935 oldestRunningXid 55621936


I also dumped 000100AA00A1 on the secondary and it contains all the 
records until AA/A1004018.


It is really weird, I don't understand how the secondary can miss the last 2 
records of A0? It seems he did not received the CHECKPOINT_SHUTDOWN record?


Any idea?

Thanks!



--
Adrien NAYRAT





Re: "invalid contrecord" error on replica

2021-05-03 Thread Adrien Nayrat

On 03/05/2021 10:43, Laurenz Albe wrote:

On Sun, 2021-05-02 at 22:43 +0200, Adrien Nayrat wrote:

LOG:  started streaming WAL from primary at AA/A100 on timeline 1
FATAL:  could not receive data from WAL stream : ERROR:  requested starting
point AA/A100 is ahead of the WAL flush position of this server AA/A0FFFBE8

You are trying to stream from a server whose WAL position is *behind* your
own.  That cannot work.

Yours,
Laurenz Albe


Yes, it was during primary's recovery as it crashed, it restarted from previous 
checkpoint.


--
Adrien NAYRAT





Re: "invalid contrecord" error on replica

2021-05-03 Thread Adrien Nayrat
Oh, I forgot to tell I was able to recover the secondary by replacing the 
000100AA00A0 from the archives into pg_wal. Then the secondary were 
able to finish recovery, start streaming replication and fetch subsequent wals.


I wondered why there was a CHECKPOINT_SHUTDOWN record. I dig a little more:

First, the filesystem on primary were full and I got :

PANIC: could not write to file "pg_wal/xlogtemp.305": No space left on device

The instance crashed and restarted in recovery mode.
At the end of the recovery I got:
checkpoint starting: end-of-recovery immediate
checkpoint complete: ...

Then a FATAL message:
FATAL: could not write to file "pg_wal/xlogtemp.9405": No space left on device

Followed by:
aborting startup due to process failure

Maybe it is this checkpoint which were not replicated? The primary had enough 
space for this record. But I don't understand how the secondary received records 
beginning by AA/A1...


I googled about this and I found other similar issues:
https://www.postgresql.org/message-id/flat/15938-8591df7e95064538%40postgresql.org


https://www.postgresql.org/message-id/CAMp7vw97871F21X7FHHdmU2FXGME4HTgMYxkAubMdCU2xevmxQ%40mail.gmail.com
https://www.postgresql.org/message-id/flat/E73F4CFB-E322-461E-B1EC-82FAA808FEE6%40lifetrenz.com

https://www.postgresql.org/message-id/15398-b4896eebf0bed218%40postgresql.org
https://www.postgresql.org/message-id/flat/15412-f9a89b026e6774d1%40postgresql.org

--
Adrien NAYRAT
https://blog.anayrat.info




Re: "invalid contrecord" error on replica

2021-05-06 Thread Adrien Nayrat

On 5/6/21 7:37 AM, Kyotaro Horiguchi wrote:

At Sun, 2 May 2021 22:43:44 +0200, Adrien Nayrat  
wrote in

I also dumped 000100AA00A1 on the secondary and it
contains all the records until AA/A1004018.

It is really weird, I don't understand how the secondary can miss the
last 2 records of A0? It seems he did not received the
CHECKPOINT_SHUTDOWN record?

Any idea?


This seems like stepping on the same issue with [1], in short, the
secondary having received an incomplete record but the primary forgot
of the record after restart.

Specifically, primary was writing a WAL record that starts at A0FFFB70
and continues to A1xx segment. The secondary successfully received
the first half of the record but the primary failed to write (then
send) the last half of the record due to disk full.

At this time it seems that the primary's last completed record ended
at A0FFB70. Then the CHECKPOINT_SHUTDOWN record overwrote the
already-halfly-sent record up to A0FFBE8 while restarting.

On the secondary side, there's only the first half of the record,
which had been forgotten by the primary and the last half starting at
LSN A100 was still the future in the new history on the primary.

After some time the primary reaches A100 but the first record in
the segment is of course disagrees with the history of the secondary.

1: 
https://www.postgresql.org/message-id/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com

regards.



Hello,

Thanks for your reply and your explanation! Now, I understand, it's good to know 
it is a known issue.
I'll follow this thread, I hope we will find a solution. It's annoying that your 
secondary breaks when your primary crash and the only solution is to either 
fetch an archived WAL file and replace it on the secondary, or completely 
rebuild your secondary.


Thanks





--
Adrien NAYRAT