Bruce J Schuchardt created GEODE-8690:
-----------------------------------------

             Summary: Member that fails availability check is never suspected 
again
                 Key: GEODE-8690
                 URL: https://issues.apache.org/jira/browse/GEODE-8690
             Project: Geode
          Issue Type: Bug
          Components: membership
    Affects Versions: 1.13.0, 1.12.0, 1.14.0
            Reporter: Bruce J Schuchardt


In a test run on support/1.12 there was a cluster with 3 locators and a number 
of servers.  It had a membership view like this:
{noformat}
[ loc1, loc2, loc3, server1, server2, etc]
{noformat}

The test killed loc1 and loc2 and tried to restart loc2.  In this scenario loc3 
should have detected the loss of the other two locators and it should have 
become the membership coordinator but it didn't.  Loc3 detected the loss of 
loc2 and then received a LEAVE request from loc1.  At that point it ought to 
have either started examining loc2 again or perhaps just become the 
coordinator, but it did neither of these and the cluster had no coordinator.

This is similar to GEODE-3780 but in that case an earlier availability check 
passed.

In the test run the names of the locators are
loc1=locatorgemfire_4_3
loc2=locatorgemfire_4_4 and
loc3=locatorgemfire_4_2

{noformat}
[info 2020/10/30 21:51:51.197 PDT <P2P message reader for 
(locatorgemfire_4_4_host2_3884:3884:locator)<ec><v1>:41005 shared unordered 
uid=2 port=42550> tid=0x36] Performing availability check for suspect member 
(locatorgemfire_4_4_host2_3884:3884:locator)<ec><v1>:41005 reason=member 
unexpectedly shut down shared, unordered connection

[info 2020/10/30 21:51:51.309 PDT <Pooled High Priority Message Processor 3> 
tid=0x51] received leave request from 
(locatorgemfire_4_3_host2_3866:3866:locator)<ec><v0>:41004 for 
(locatorgemfire_4_3_host2_3866:3866:locator)<ec><v0>:41004

[info 2020/10/30 21:51:51.345 PDT <Pooled High Priority Message Processor 3> 
tid=0x51] Checking to see if I should become coordinator.  My address is 
(locatorgemfire_4_2_host2_3852:3852:locator)<ec><v1>:41007

[info 2020/10/30 21:51:51.346 PDT <Pooled High Priority Message Processor 3> 
tid=0x51] View with removed and left members removed is 
View[rs-(locatorgemfire_4_3_host2_3866:3866:locator)<ec><v0>:41004|3] members: 
[(locatorgemfire_4_4_host2_3884:3884:locator)<ec><v1>:41005, 
(locatorgemfire_4_2_host2_3852:3852:locator)<ec><v1>:41007, 
(locatorgemfire_4_1_host2_3843:3843:locator)<ec><v1>:41006, 
(peergemfire_4_1_host2_3959:3959)<ec><v2>:41010{lead}, 
(peergemfire_4_2_host2_3967:3967)<ec><v2>:41009] and coordinator would be 
(locatorgemfire_4_4_host2_3884:3884:locator)<ec><v1>:41005
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to