[ 
https://issues.apache.org/jira/browse/GEODE-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733513#comment-16733513
 ] 

ASF subversion and git services commented on GEODE-6244:
--------------------------------------------------------

Commit 25134b19e2a324ff04c3a3d1139bafe641031729 in geode's branch 
refs/heads/feature/GEODE-6244 from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=25134b1 ]

GEODE-6244 Healthy member kicked out by Sick member

GMSMembershipManager.verifyMember() should not initiate direct removal
of the target member if an availability check fails.  Instead it should
initiate suspect processing.

This adds new unit tests for GMSHealthMonitor.checkIfAvailable() and
changes the availability check to initiate suspect processing if the
check fails.


> Healthy member kicked out by Sick member when final-check fails
> ---------------------------------------------------------------
>
>                 Key: GEODE-6244
>                 URL: https://issues.apache.org/jira/browse/GEODE-6244
>             Project: Geode
>          Issue Type: New Feature
>          Components: membership
>            Reporter: Bruce Schuchardt
>            Priority: Major
>
> I observed this in a user's logs & can't include artifacts:  Clients were 
> herding to one server when another server was being slow to return results.  
> The clients caused the server to run out of file descriptors because the 
> descriptor limit was set pretty low.  When that happened the server had 
> trouble forming an outgoing tcp/ip connection to another server.  It tried 
> using MembershipManager.verifyMember() which also failed to connect to the 
> other server.  When that happened it sent a RemoveMessage to the locators and 
> several of the other servers, including the one it couldn't connect to.  That 
> server immediately shut itself down.
> MembershipManager.verifyMember() is documented to only initiate suspect 
> processing on the target, not initiate immediate removal.  This is supposed 
> to be done so that some other process (i.e., the membership coordinator) will 
> do additional checking on the suspect in case the initiator is itself sick.  
> That was the case in this situation.
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends RemoveMember message to locators and serverB
> serverB shuts itself down (ForcedDisconnect)
> The behavior should instead be
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends SuspectMember message to locators & other servers
> coordinator performs tcp/ip and heartbeat check on the suspect
> coordinator determines suspect is available
> This is all due to the checkMember call in GMSMembershipManager passing 
> _true_ for the _initiateRemoval_ parameter.  It should be passing _false_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to