[
https://issues.apache.org/jira/browse/GEODE-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745639#comment-16745639
]
ASF subversion and git services commented on GEODE-6244:
--------------------------------------------------------
Commit efcfa9390c77814f7654a37e08fb76b42b782b8d in geode's branch
refs/heads/develop from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=efcfa93 ]
Revert "GEODE-6244 Healthy member kicked out by sick member"
There are unit test failures caused by what I thought were innocuous changes
This reverts commit f4b8cf2f8dbcb98b541b24238b50b4066ff136a8.
> Healthy member kicked out by Sick member when final-check fails
> ---------------------------------------------------------------
>
> Key: GEODE-6244
> URL: https://issues.apache.org/jira/browse/GEODE-6244
> Project: Geode
> Issue Type: New Feature
> Components: membership
> Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1, 1.4.0, 1.5.0, 1.6.0,
> 1.7.0, 1.8.0
> Reporter: Bruce Schuchardt
> Priority: Major
> Fix For: 1.9.0
>
>
> I observed this in a user's logs & can't include artifacts: Clients were
> herding to one server when another server was being slow to return results.
> The clients caused the server to run out of file descriptors because the
> descriptor limit was set pretty low. When that happened the server had
> trouble forming an outgoing tcp/ip connection to another server. It tried
> using MembershipManager.verifyMember() which also failed to connect to the
> other server. When that happened it sent a RemoveMessage to the locators and
> several of the other servers, including the one it couldn't connect to. That
> server immediately shut itself down.
> MembershipManager.verifyMember() is documented to only initiate suspect
> processing on the target, not initiate immediate removal. This is supposed
> to be done so that some other process (i.e., the membership coordinator) will
> do additional checking on the suspect in case the initiator is itself sick.
> That was the case in this situation.
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends RemoveMember message to locators and serverB
> serverB shuts itself down (ForcedDisconnect)
> The behavior should instead be
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends SuspectMember message to locators & other servers
> coordinator performs tcp/ip and heartbeat check on the suspect
> coordinator determines suspect is available
> This is all due to the checkMember call in GMSMembershipManager passing
> _true_ for the _initiateRemoval_ parameter. It should be passing _false_.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)