Re: [ClusterLabs] Automatic recover from split brain ?

Adam Cécile Tue, 11 Aug 2020 00:35:10 -0700

On 8/11/20 8:48 AM, Andrei Borzenkov wrote:

08.08.2020 13:10, Adam Cécile пишет:

Hello,



I'm experiencing issue with corosync/pacemaker running on Debian Buster.
Cluster has three nodes running in VMWare virtual machine and the
cluster fails when VEEAM backups the virtual machine (I know it's doing
bad things, like freezing completely the VM for a few minutes to make
disk snapshot).

My biggest issue is that once the backup has been completed, the cluster
stays in split brain state, and I'd like it to heal itself. Here current
status:


One node is isolated:

Stack: corosync
Current DC: host2.domain.com (version 2.0.1-9e909a5bdd) - partition
WITHOUT quorum
Last updated: Sat Aug  8 11:59:46 2020
Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on
host1.domain.com

3 nodes configured
6 resources configured

Online: [ host2.domain.com ]
OFFLINE: [ host3.domain.com host1.domain.com ]


Two others are seeing each others:

Stack: corosync
Current DC: host3.domain.com (version 2.0.1-9e909a5bdd) - partition with
quorum
Last updated: Sat Aug  8 12:07:56 2020
Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on
host1.domain.com

3 nodes configured
6 resources configured

Online: [ host3.domain.com host1.domain.com ]
OFFLINE: [ host2.domain.com ]

Show your full configuration including defined STONITH resources and
cluster options (most importantly, no-quorum-policy and stonith-enabled).


Hello,

Stonith is disabled and I tried various settings for no-quorum-policy.

The problem is that one of the resources is a floating IP address which
is currently assigned to two different hosts...

Of course - each partition assumes another partition is dead and so it
is free to take over remaining resources.

I understand that but I still don't get why once all nodes are backonline, the cluster does not heal from resources running one multiple hosts.

Can you help me configuring the cluster correctly so this cannot occurs ?

Define "correctly".

The most straightforward text book answer - you need to have STONITH
resources that will eliminate "lost" node. But your lost node is in the
middle of performing backup. Eliminating it may invalidate backup being
created.

Yeah but well, no. Killing the node is worse, sensible services arealready running in clustering mode at application level so they do notrely on corosync. Basically corosync is providing a floating IP for someexternal non critical access and starting systemd timers that arepointless to be run on multiple hosts. Nothing critical here.


So another answer would be - put cluster in maintenance mode, perform
backup, resume normal operation. Usually backup software allows hooks to
be executed before and after backup. It may work too.

This in indeed something I might look at, but again, for my trivialneeds it sounds a bit overkill to me.

Or find a way to not freeze VM during backup ... e.g. by using different
backup method?

Or tweaks some network settings so corosync does not consider the nodeas being dead too soon ? Backup won't last more than 2 minutes and thefreeze is usually way below. I can definitely leave with cluster statebeing unknown for a couple of minutes. Is that possible ?

Removing VEEAM is indeed my last option and the one I used so far, butthis time I was hoping someone else would be experiencing the same issueand could help me fixing that in a clean way.



Thanks

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Automatic recover from split brain ?

Reply via email to