Hi ClusterLabs,

I'm seeing a race condition in corosync where votequorum can have incorrect membership info when a node joins the cluster then leaves very soon after.

I'm on corosync-2.3.4 plus my patch https://github.com/corosync/corosync/pull/248. That patch makes the problem readily reproducible but the bug was already present.

Here's the scenario. I have two hosts, cluster1 and cluster2. The corosync.conf on cluster2 is:

    totem {
      version: 2
      cluster_name: test
      config_version: 2
      transport: udpu
    }
    nodelist {
      node {
        nodeid: 1
        ring0_addr: cluster1
      }
      node {
        nodeid: 2
        ring0_addr: cluster2
      }
    }
    quorum {
      provider: corosync_votequorum
      auto_tie_breaker: 1
    }
    logging {
      to_syslog: yes
    }

The corosync.conf on cluster1 is the same except with "config_version: 1".

I start corosync on cluster2. When I start corosync on cluster1, it joins and then immediately leaves due to the lower config_version. (Previously corosync on cluster2 would also exit but with https://github.com/corosync/corosync/pull/248 it remains alive.)

But often at this point, cluster1's disappearance is not reflected in the votequorum info on cluster2:

    Quorum information
    ------------------
    Date:             Tue Oct 10 16:43:50 2017
    Quorum provider:  corosync_votequorum
    Nodes:            1
    Node ID:          2
    Ring ID:          700
    Quorate:          Yes

    Votequorum information
    ----------------------
    Expected votes:   2
    Highest expected: 2
    Total votes:      2
    Quorum:           2
    Flags:            Quorate AutoTieBreaker

    Membership information
    ----------------------
        Nodeid      Votes Name
             2          1 cluster2 (local)

The logs on cluster1 show:

Oct 10 16:43:37 cluster1 corosync[15750]: [CMAP ] Received config version (2) is different than my config version (1)! Exiting

The logs on cluster2 show:

Oct 10 16:43:37 cluster2 corosync[5102]: [TOTEM ] A new membership (10.71.218.17:588) was formed. Members joined: 1 Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] This node is within the primary component and will provide service.
    Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
Oct 10 16:43:37 cluster2 corosync[5102]: [TOTEM ] A new membership (10.71.218.18:592) was formed. Members left: 1
    Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
Oct 10 16:43:37 cluster2 corosync[5102]: [MAIN ] Completed service synchronization, ready to provide service.

It looks like QUORUM has seen cluster1's arrival but not its departure!

When it works as expected, the state is left consistent:

    Quorum information
    ------------------
    Date:             Tue Oct 10 16:58:14 2017
    Quorum provider:  corosync_votequorum
    Nodes:            1
    Node ID:          2
    Ring ID:          604
    Quorate:          No

    Votequorum information
    ----------------------
    Expected votes:   2
    Highest expected: 2
    Total votes:      1
    Quorum:           2 Activity blocked
    Flags:            AutoTieBreaker

    Membership information
    ----------------------
        Nodeid      Votes Name
             2          1 cluster2 (local)

Logs on cluster1:

Oct 10 16:58:01 cluster1 corosync[16430]: [CMAP ] Received config version (2) is different than my config version (1)! Exiting

Logs on cluster2 are either:

Oct 10 16:58:01 cluster2 corosync[17835]: [TOTEM ] A new membership (10.71.218.17:600) was formed. Members joined: 1 Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] This node is within the primary component and will provide service.
    Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
Oct 10 16:58:01 cluster2 corosync[17835]: [CMAP ] Highest config version (2) and my config version (2) Oct 10 16:58:01 cluster2 corosync[17835]: [TOTEM ] A new membership (10.71.218.18:604) was formed. Members left: 1 Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
    Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
Oct 10 16:58:01 cluster2 corosync[17835]: [MAIN ] Completed service synchronization, ready to provide service.

... in which it looks like QUORUM has seen cluster1's arrival *and* its departure,

or:

Oct 10 16:59:03 cluster2 corosync[18841]: [TOTEM ] A new membership (10.71.218.17:632) was formed. Members joined: 1 Oct 10 16:59:03 cluster2 corosync[18841]: [CMAP ] Highest config version (2) and my config version (2) Oct 10 16:59:03 cluster2 corosync[18841]: [TOTEM ] A new membership (10.71.218.18:636) was formed. Members left: 1
    Oct 10 16:59:03 cluster2 corosync[18841]:  [QUORUM] Members[1]: 2
Oct 10 16:59:03 cluster2 corosync[18841]: [MAIN ] Completed service synchronization, ready to provide service.

... in which it looks like QUORUM never noticed cluster1's brief presence.

Any thoughts?

Thanks,
Jonathan

_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to