Re: Member that is shutting down initiate removal of other members from the cluster

Ernie Burghardt Mon, 19 Oct 2020 11:27:00 -0700

Hi Jakov,

I'm looking into your question(s)... curious if you've run into this in a 
non-k8s cluster?
Might help focus the investigation...


Thanks,
EB

On 10/13/20, 7:51 AM, "Jakov Varenina" <[email protected]> wrote:

    Hi all,

    sorry for bothering, but we have noticed some differences in behavior 
    between 1.11 and 1.12 releases and need your help in understanding them.

    First I would like to mention that we are running geode in Kubernetes. 
    We perform shutdown of the worker node that is hosting one member(e.g. 
    coordinator locator). Shutdown procedure affect member in a following way:

    1. TCP shared unordered connections towards other members are terminated

    2. Member receives graceful shut-down indication and starts with the 
    shut-down procedure

    Usually connections starting to be terminated first and the shut-down 
    indication comes short after (e.g. ~10 milliseconds in difference). The 
    step 1. triggers availability check towards the other members for which 
    TCP connection has been previously lost. At this point of time 
    coordinator is unaware of ongoing shut-down and assumes that all other 
    members are actually having issues due to connection loss. Even after 
    coordinator receives the graceful shut-down indication this process of 
    availability check is not stopped. What happens later on is that 
    availability check fail for all members and coordinator initiates their 
    removal with RemoveMemberMessage. This message is succesfully received 
    on the other members forcing them to shut-down.

    In geode 1.11 everything is same except the fact that availability check 
    pass and therefore removals aren't initiated.

    In logs it can be seen that for both releases TCP availability check 
    fail, but HeartbeatMessageRequest/HearbeatMessage check fails only on 
    1.12 and pass on 1.11. In 1.12 release it can be seen that heartbeat 
    request and heartbeat messages are sent but does not reach their 
    destination members. RemoveMemberMessage which are sent later on reach 
    their destination successfully. Does anybody know what was changed in 
    1.12 that could lead to such difference in behavior?

    Additionally, availability check is not stopped when graceful shutdown 
    is initiated. Do you think that this could be improved, so that member 
    stops ongoing availability check when detects gracefull shutdown? Just 
    to add that shutdown procedure is also delayed due to unsuccessful 
    attempts to estabilsh TCP connections towards the other members.

    BRs,
    Jakov

Re: Member that is shutting down initiate removal of other members from the cluster

Reply via email to