>>> damiano giuliani <[email protected]> schrieb am 13.07.2021 um >>> 13:42 in Nachricht <CAG=zYNPzXNFEMxR1dZdzhKT122=_mrgqelk8rnypxgh7y29...@mail.gmail.com>: > Hi guys, > im back with some PAF postgres cluster problems. > tonight the cluster fenced the master node and promote the PAF resource to > a new node. > everything went fine, unless i really dont know why. > so this morning i noticed the old master was fenced by sbd and a new master > was promoted, this happen tonight at 00.40.XX. > filtering the logs i cant find out the any reasons why the old master was > fenced and the start of promotion of the new master (which seems went > perfectly), at certain point, im a bit lost cuz non of us can is able to > get the real reason. > the cluster worked flawessy for days with no issues, till now. > crucial for me uderstand why this switch occured. > > a attached the current status and configuration and logs. > on the old master node log cant find any reasons > on the new master the only thing is the fencing and the promotion. > > > PS: > could be this the reason of fencing?
First I think your timeouts are rather aggressive. Hope there are no virtual machines involved. Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [TOTEM ] Failed to receive the leave message. failed: 1 This may be a networking problem, or the other node dies for some unknown reason. That's the reason for fencing. Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: Our peer on the DC (ltaoperdbs02) is dead Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Node ltaoperdbs02 is unclean You said there is no reason for fencing, but here it is! Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Scheduling Node ltaoperdbs02 for STONITH Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: notice: * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster' The fencing timing is also quite aggressive IMHO. Could it be that a command saturated the network? Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13 00:39:28.936 UTC [172262] LOG: duration: 660.329 ms execute <unnamed>: SELECT xmf.file_id, f.size, fp.full_path FROM ism_x_medium_file xmf JOIN#011 ism_files f ON f.id_file = xmf.file_id JOIN#011 ism_files_path fp ON f.id_file = fp.file_id JOIN ism_online o ON o.file_id = xmf.file_id WHERE xmf.medium_id = 363 AND xmf.x_media_file_status_id = 1 AND o.online_status_id = 3 GROUP BY xmf.file_id, f.size, fp.full_path LIMIT 7265 ; Regards, Ulrich > > grep -e sbd /var/log/messages > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant > pcmk is outdated (age: 4) > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: Servant > pcmk is healthy (age: 0) > > Any though and help is really appreciate. > > Damiano _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
