[ClusterLabs] corosync continues even if node is removed from the cluster

Jonathan Davies Fri, 20 Oct 2017 05:43:05 -0700

Hi ClusterLabs,

I have a query about safely removing a node from a corosync cluster.

When "corosync-cfgtool -R" is issued, it causes all nodes to reloadtheir config from corosync.conf. If I have removed a node from thenodelist but corosync is still running on that node, it will receive thereload signal but will try to continue as if nothing had happened. Thisthen causes various problems on all nodes.


A specific example:

I have a running cluster containing two nodes: 10.71.217.70 (nodeid=1)and 10.71.217.71 (nodeid=2). When I remove node 1 from the nodelist incorosync.conf on both nodes then issue "corosync-cfgtool -R" on10.71.217.71, I see this on 10.71.217.70:


  Quorum information
  ------------------
  Date:             Fri Oct 20 13:23:02 2017
  Quorum provider:  corosync_votequorum
  Nodes:            2
  Node ID:          1
  Ring ID:          124
  Quorate:          Yes

  Votequorum information
  ----------------------
  Expected votes:   2
  Highest expected: 2
  Total votes:      2
  Quorum:           2
  Flags:            Quorate AutoTieBreaker

  Membership information
  ----------------------
      Nodeid      Votes Name
           1          1 cluster1 (local)
           2          1 10.71.217.71

and this on 10.71.217.71:

  Quorum information
  ------------------
  Date:             Fri Oct 20 13:22:46 2017
  Quorum provider:  corosync_votequorum
  Nodes:            1
  Node ID:          2
  Ring ID:          132
  Quorate:          No

  Votequorum information
  ----------------------
  Expected votes:   2
  Highest expected: 2
  Total votes:      1
  Quorum:           2 Activity blocked
  Flags:

  Membership information
  ----------------------
      Nodeid      Votes Name
           2          1 10.71.217.71 (local)

Instead, I would expect corosync on node 1 to exit and node 2 to have"expected votes: 1, total votes: 1, quorate: yes".

I notice that there is already some logic in votequorum.c that detectsthis condition, and it produces the following log messages on node 1:


  debug   [VOTEQ ] No nodelist defined or our node is not in the nodelist

crit [VOTEQ ] configuration error: nodelist orquorum.expected_votes must be configured!

  crit    [VOTEQ ] will continue with current runtime data

What is the rationale for continuing despite the obvious inconsistency?Surely this is destined to cause problems...?


I find that I get my expected behaviour with the following patch:

diff --git a/exec/votequorum.c b/exec/votequorum.c
index 1a97c6d..4ff7ff2 100644
--- a/exec/votequorum.c
+++ b/exec/votequorum.c
@@ -1286,7 +1287,8 @@ static char *votequorum_readconfig(int runtime)

error = (char *)"configuration error: nodelistor quorum.expected_votes must be configured!";

                } else {

log_printf(LOGSYS_LEVEL_CRIT, "configurationerror: nodelist or quorum.expected_votes must be configured!");- log_printf(LOGSYS_LEVEL_CRIT, "will continuewith current runtime data");

+                       log_printf(LOGSYS_LEVEL_CRIT, "exiting...");
+                       exit(1);
                }
                goto out;
        }

Is there any reason why that would not be a good idea?

Thanks,
Jonathan

_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] corosync continues even if node is removed from the cluster

Reply via email to