[
https://issues.apache.org/jira/browse/GEODE-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Owen Nichols closed GEODE-9822.
-------------------------------
> Split-brain Certain During Network Partition in Two-Locator Cluster
> -------------------------------------------------------------------
>
> Key: GEODE-9822
> URL: https://issues.apache.org/jira/browse/GEODE-9822
> Project: Geode
> Issue Type: Bug
> Components: membership
> Reporter: Bill Burcham
> Assignee: Bill Burcham
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.15.0
>
>
> In a two-locator cluster with default member weights and default setting
> (true) of enable-network-partition-detection, if a long-lived network
> partition separates the two members, a split-brain will arise: there will be
> two coordinators at the same time.
> The reason for this can be found in the GMSJoinLeave.isNetworkPartition()
> method. That method's name is misleading. A name like isMajorityLost() would
> probably be more apt. It needs to return true iff the weight of "crashed"
> members (in the prospective view) is greater-than-or-equal-to half (50%) of
> the total weight (of all members in the current view).
> What the method actually does is return true iff the weight of "crashed"
> members is greater-than 51% of the total weight. As a result, if we have two
> members of equal weight, and the coordinator sees that the non-coordinator is
> "crashed", the coordinator will keep running. If a network partition is
> happening, and the non-coordinator is still running, then it will become a
> coordinator and start producing views. Now we'll have two coordinators
> producing views concurrently.
> For this discussion "crashed" members are members for which the coordinator
> has received a RemoveMemberRequest message. These are members that the
> failure detector has deemed failed. Keep in mind the failure detector is
> imperfect (it's not always right), and that's kind of the whole point of this
> ticket: we've lost contact with the non-coordinator member, but that doesn't
> mean it can't still be running (on the other side of a partition).
> This bug is not limited to the two-locator scenario. Any set of members that
> can be partitioned into two equal sets is susceptible. In fact it's even a
> little worse than that. Any set of members that can be partitioned (into more
> than one set), where any two-or-more sets, each still have 49% or more of the
> total weight, will result in a split-brain
--
This message was sent by Atlassian Jira
(v8.20.7#820007)