Bruce Schuchardt created GEODE-6244:
---------------------------------------

             Summary: Healthy member kicked out by Sick member when final-check 
fails
                 Key: GEODE-6244
                 URL: https://issues.apache.org/jira/browse/GEODE-6244
             Project: Geode
          Issue Type: New Feature
          Components: membership
            Reporter: Bruce Schuchardt


I observed this in a user's logs & can't include artifacts:  Clients were 
herding to one server when another server was being slow to return results.  
The clients caused the server to run out of file descriptors because the 
descriptor limit was set pretty low.  When that happened the server had trouble 
forming an outgoing tcp/ip connection to another server.  It tried using 
MembershipManager.verifyMember() which also failed to connect to the other 
server.  When that happened it sent a RemoveMessage to the locators and several 
of the other servers, including the one it couldn't connect to.  That server 
immediately shut itself down.

MembershipManager.verifyMember() is documented to only initiate suspect 
processing on the target, not initiate immediate removal.  This is supposed to 
be done so that some other process (i.e., the membership coordinator) will do 
additional checking on the suspect in case the initiator is itself sick.  That 
was the case in this situation.

serverA unable to connect to serverB
serverA performs tcp/ip check in verifyMember
serverA's tcp/ip check fails (it's out of file descriptors, duh)
serverA sends RemoveMember message to locators and serverB
serverB shuts itself down (ForcedDisconnect)

The behavior should instead be

serverA unable to connect to serverB
serverA performs tcp/ip check in verifyMember
serverA's tcp/ip check fails (it's out of file descriptors, duh)
serverA sends SuspectMember message to locators & other servers
coordinator performs tcp/ip and heartbeat check on the suspect
coordinator determines suspect is available

This is all due to the checkMember call in GMSMembershipManager passing _true_ 
for the _initiateRemoval_ parameter.  It should be passing _false_.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to