[ 
https://issues.apache.org/jira/browse/GEODE-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Schuchardt updated GEODE-6244:
------------------------------------
    Affects Version/s: 1.1.0
                       1.1.1
                       1.2.0
                       1.3.0
                       1.2.1
                       1.4.0
                       1.5.0
                       1.6.0
                       1.7.0
                       1.8.0

> Healthy member kicked out by Sick member when final-check fails
> ---------------------------------------------------------------
>
>                 Key: GEODE-6244
>                 URL: https://issues.apache.org/jira/browse/GEODE-6244
>             Project: Geode
>          Issue Type: New Feature
>          Components: membership
>    Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1, 1.4.0, 1.5.0, 1.6.0, 
> 1.7.0, 1.8.0
>            Reporter: Bruce Schuchardt
>            Priority: Major
>             Fix For: 1.9.0
>
>
> I observed this in a user's logs & can't include artifacts:  Clients were 
> herding to one server when another server was being slow to return results.  
> The clients caused the server to run out of file descriptors because the 
> descriptor limit was set pretty low.  When that happened the server had 
> trouble forming an outgoing tcp/ip connection to another server.  It tried 
> using MembershipManager.verifyMember() which also failed to connect to the 
> other server.  When that happened it sent a RemoveMessage to the locators and 
> several of the other servers, including the one it couldn't connect to.  That 
> server immediately shut itself down.
> MembershipManager.verifyMember() is documented to only initiate suspect 
> processing on the target, not initiate immediate removal.  This is supposed 
> to be done so that some other process (i.e., the membership coordinator) will 
> do additional checking on the suspect in case the initiator is itself sick.  
> That was the case in this situation.
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends RemoveMember message to locators and serverB
> serverB shuts itself down (ForcedDisconnect)
> The behavior should instead be
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends SuspectMember message to locators & other servers
> coordinator performs tcp/ip and heartbeat check on the suspect
> coordinator determines suspect is available
> This is all due to the checkMember call in GMSMembershipManager passing 
> _true_ for the _initiateRemoval_ parameter.  It should be passing _false_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to