[ 
https://issues.apache.org/jira/browse/GEODE-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745657#comment-16745657
 ] 

ASF subversion and git services commented on GEODE-6244:
--------------------------------------------------------

Commit fce4d61dae28c4b244ee71fb4a10ce0bff11a6c9 in geode's branch 
refs/heads/feature/GEODE-6244b from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=fce4d61 ]

GEODE-6244 Healthy member kicked out by sick member

- do not allow membership manager suspect initiation to kick out a
member on the first failed check
- perform a self-health check before sending SuspectRequest messages
- consider members who have sent shutdown messages as gone when
performing "should I become coordinator" checks in GMSHealthMonitor
- modified the membership view installed by GMSJoinLeave to be immutable
so it isn't inadvertently changes

Squashed commit of the following:

commit 44f37c38d3b42f1ec7b1c440cae234b3fc123955
Author: Bruce Schuchardt <bschucha...@pivotal.io>
Date:   Thu Jan 17 14:28:25 2019 -0800

    fixes for regression failures

commit 320e98f85f2e16ea5a48dc316aeb81094b7cfd8d
Author: Bruce Schuchardt <bschucha...@pivotal.io>
Date:   Fri Jan 4 14:42:03 2019 -0800

    fix for failing unit tests & a lgtm warning

commit 144f94335042fa8d879413edefe48aa02abb7cb3
Author: Bruce Schuchardt <bschucha...@pivotal.io>
Date:   Fri Jan 4 10:36:22 2019 -0800

    fixes for unit test hang

    - remove suspect from members-in-final-check collection and initiate
    both remote and local suspect processing
    - renamed "final check" to "availability check" since it isn't
    necessarily a "final" check
    - perform a self-check before telling others to check a suspect

commit fb3dfd00477cc48fb2d4dd85fe1ec532ed68f82b
Author: Bruce Schuchardt <bschucha...@pivotal.io>
Date:   Thu Jan 3 14:22:51 2019 -0800

    leave the member unsuspected after final check fails

    If the final check fails and we're not going to remove the suspect from
    the distributed system we need to leave it in an "unsuspected" state
    locally so that the background monitoring thread will look at it again.

    Also, if the final check failed in the membership coordinator there's no
    point in doing another check so we move directly to removing the
    suspect.

commit 25134b19e2a324ff04c3a3d1139bafe641031729
Author: Bruce Schuchardt <bschucha...@pivotal.io>
Date:   Thu Jan 3 12:49:56 2019 -0800

    GEODE-6244 Healthy member kicked out by Sick member

    GMSMembershipManager.verifyMember() should not initiate direct removal
    of the target member if an availability check fails.  Instead it should
    initiate suspect processing.

    This adds new unit tests for GMSHealthMonitor.checkIfAvailable() and
    changes the availability check to initiate suspect processing if the
    check fails.

(cherry picked from commit f4b8cf2f8dbcb98b541b24238b50b4066ff136a8)


> Healthy member kicked out by Sick member when final-check fails
> ---------------------------------------------------------------
>
>                 Key: GEODE-6244
>                 URL: https://issues.apache.org/jira/browse/GEODE-6244
>             Project: Geode
>          Issue Type: New Feature
>          Components: membership
>    Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1, 1.4.0, 1.5.0, 1.6.0, 
> 1.7.0, 1.8.0
>            Reporter: Bruce Schuchardt
>            Priority: Major
>             Fix For: 1.9.0
>
>
> I observed this in a user's logs & can't include artifacts:  Clients were 
> herding to one server when another server was being slow to return results.  
> The clients caused the server to run out of file descriptors because the 
> descriptor limit was set pretty low.  When that happened the server had 
> trouble forming an outgoing tcp/ip connection to another server.  It tried 
> using MembershipManager.verifyMember() which also failed to connect to the 
> other server.  When that happened it sent a RemoveMessage to the locators and 
> several of the other servers, including the one it couldn't connect to.  That 
> server immediately shut itself down.
> MembershipManager.verifyMember() is documented to only initiate suspect 
> processing on the target, not initiate immediate removal.  This is supposed 
> to be done so that some other process (i.e., the membership coordinator) will 
> do additional checking on the suspect in case the initiator is itself sick.  
> That was the case in this situation.
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends RemoveMember message to locators and serverB
> serverB shuts itself down (ForcedDisconnect)
> The behavior should instead be
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends SuspectMember message to locators & other servers
> coordinator performs tcp/ip and heartbeat check on the suspect
> coordinator determines suspect is available
> This is all due to the checkMember call in GMSMembershipManager passing 
> _true_ for the _initiateRemoval_ parameter.  It should be passing _false_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to