Hi ClusterLabs,

I have a query about safely removing a node from a corosync cluster.

When "corosync-cfgtool -R" is issued, it causes all nodes to reload their config from corosync.conf. If I have removed a node from the nodelist but corosync is still running on that node, it will receive the reload signal but will try to continue as if nothing had happened. This then causes various problems on all nodes.

A specific example:

I have a running cluster containing two nodes: 10.71.217.70 (nodeid=1) and 10.71.217.71 (nodeid=2). When I remove node 1 from the nodelist in corosync.conf on both nodes then issue "corosync-cfgtool -R" on 10.71.217.71, I see this on 10.71.217.70:

  Quorum information
  ------------------
  Date:             Fri Oct 20 13:23:02 2017
  Quorum provider:  corosync_votequorum
  Nodes:            2
  Node ID:          1
  Ring ID:          124
  Quorate:          Yes

  Votequorum information
  ----------------------
  Expected votes:   2
  Highest expected: 2
  Total votes:      2
  Quorum:           2
  Flags:            Quorate AutoTieBreaker

  Membership information
  ----------------------
      Nodeid      Votes Name
           1          1 cluster1 (local)
           2          1 10.71.217.71

and this on 10.71.217.71:

  Quorum information
  ------------------
  Date:             Fri Oct 20 13:22:46 2017
  Quorum provider:  corosync_votequorum
  Nodes:            1
  Node ID:          2
  Ring ID:          132
  Quorate:          No

  Votequorum information
  ----------------------
  Expected votes:   2
  Highest expected: 2
  Total votes:      1
  Quorum:           2 Activity blocked
  Flags:

  Membership information
  ----------------------
      Nodeid      Votes Name
           2          1 10.71.217.71 (local)

Instead, I would expect corosync on node 1 to exit and node 2 to have "expected votes: 1, total votes: 1, quorate: yes".

I notice that there is already some logic in votequorum.c that detects this condition, and it produces the following log messages on node 1:

  debug   [VOTEQ ] No nodelist defined or our node is not in the nodelist
crit [VOTEQ ] configuration error: nodelist or quorum.expected_votes must be configured!
  crit    [VOTEQ ] will continue with current runtime data

What is the rationale for continuing despite the obvious inconsistency? Surely this is destined to cause problems...?

I find that I get my expected behaviour with the following patch:

diff --git a/exec/votequorum.c b/exec/votequorum.c
index 1a97c6d..4ff7ff2 100644
--- a/exec/votequorum.c
+++ b/exec/votequorum.c
@@ -1286,7 +1287,8 @@ static char *votequorum_readconfig(int runtime)
error = (char *)"configuration error: nodelist or quorum.expected_votes must be configured!";
                } else {
log_printf(LOGSYS_LEVEL_CRIT, "configuration error: nodelist or quorum.expected_votes must be configured!"); - log_printf(LOGSYS_LEVEL_CRIT, "will continue with current runtime data");
+                       log_printf(LOGSYS_LEVEL_CRIT, "exiting...");
+                       exit(1);
                }
                goto out;
        }

Is there any reason why that would not be a good idea?

Thanks,
Jonathan

_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to