[ https://issues.apache.org/jira/browse/GEODE-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745639#comment-16745639 ]
ASF subversion and git services commented on GEODE-6244: -------------------------------------------------------- Commit efcfa9390c77814f7654a37e08fb76b42b782b8d in geode's branch refs/heads/develop from Bruce Schuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=efcfa93 ] Revert "GEODE-6244 Healthy member kicked out by sick member" There are unit test failures caused by what I thought were innocuous changes This reverts commit f4b8cf2f8dbcb98b541b24238b50b4066ff136a8. > Healthy member kicked out by Sick member when final-check fails > --------------------------------------------------------------- > > Key: GEODE-6244 > URL: https://issues.apache.org/jira/browse/GEODE-6244 > Project: Geode > Issue Type: New Feature > Components: membership > Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1, 1.4.0, 1.5.0, 1.6.0, > 1.7.0, 1.8.0 > Reporter: Bruce Schuchardt > Priority: Major > Fix For: 1.9.0 > > > I observed this in a user's logs & can't include artifacts: Clients were > herding to one server when another server was being slow to return results. > The clients caused the server to run out of file descriptors because the > descriptor limit was set pretty low. When that happened the server had > trouble forming an outgoing tcp/ip connection to another server. It tried > using MembershipManager.verifyMember() which also failed to connect to the > other server. When that happened it sent a RemoveMessage to the locators and > several of the other servers, including the one it couldn't connect to. That > server immediately shut itself down. > MembershipManager.verifyMember() is documented to only initiate suspect > processing on the target, not initiate immediate removal. This is supposed > to be done so that some other process (i.e., the membership coordinator) will > do additional checking on the suspect in case the initiator is itself sick. > That was the case in this situation. > serverA unable to connect to serverB > serverA performs tcp/ip check in verifyMember > serverA's tcp/ip check fails (it's out of file descriptors, duh) > serverA sends RemoveMember message to locators and serverB > serverB shuts itself down (ForcedDisconnect) > The behavior should instead be > serverA unable to connect to serverB > serverA performs tcp/ip check in verifyMember > serverA's tcp/ip check fails (it's out of file descriptors, duh) > serverA sends SuspectMember message to locators & other servers > coordinator performs tcp/ip and heartbeat check on the suspect > coordinator determines suspect is available > This is all due to the checkMember call in GMSMembershipManager passing > _true_ for the _initiateRemoval_ parameter. It should be passing _false_. -- This message was sent by Atlassian JIRA (v7.6.3#76005)