On Tue, 7 Apr 2020 14:13:35 -0400 Sherrard Burton <[email protected]> wrote:
> On 4/7/20 1:16 PM, Andrei Borzenkov wrote: > > 07.04.2020 00:21, Sherrard Burton пишет: > >>> > >>> It looks like some timing issue or race condition. After reboot node > >>> manages to contact qnetd first, before connection to other node is > >>> established. Qnetd behaves as documented - it sees two equal size > >>> partitions and favors the partition that includes tie breaker (lowest > >>> node id). So existing node goes out of quorum. Second later both nodes > >>> see each other and so quorum is regained. > >> > > > > Define the right problem to solve? > > > > Educated guess is that your problem is not corosync but pacemaker > > stopping resources. In this case just do what was done for years in two > > node cluster - set no-quorum-policy=ignore and rely on stonith to > > resolve split brain. > > > > I dropped idea to use qdevice in two node cluster. If you have reliable > > stonith device it is not needed and without stonith relying on watchdog > > suicide has too many problems. > > > > Andrei, > in a two-node cluster with stonith only, but no qdevice, how do you > avoid the dreaded stonith death match, and the resultant flip-flopping > of services? In my understanding, two_node and wait_for_all should avoid this. After a node A has been fenced, the node B keeps the quorum thanks to two_node. When A comes back, as long as it is not able to join the corosync group, it will not be quorate thanks to wait_for_all. No quorum, no fencing allowed. But the best protection is to disable pacemaker on boot so an admin can investigate the situation and join back the node safely. Regards, _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
