[jira] [Reopened] (GEODE-3780) suspected member is never watched again after passing final check

Bruce Schuchardt (JIRA) Fri, 24 Aug 2018 16:04:09 -0700


     [ 
https://issues.apache.org/jira/browse/GEODE-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bruce Schuchardt reopened GEODE-3780:
-------------------------------------

Another variant of this bug was hit in network-down testing.

The problem is happening when the network partition isn't instantaneous, which 
is the case with most test frameworks that use iptable manipulation.

The network from the losing side to the surviving side was shut down and 
process 13989 initiates a final existence check on one of the surviving-side 
processes:

{noformat}
system.log: [info 2018/08/21 21:56:43.082 PDT gemfire1_host1_13989 <Geode 
Failure Detection thread 1> tid=0x84] Performing final check for suspect member 
10.32.111.37(gemfire4_host2_12365:12365:locator)<ec><v1>:1025 reason=Unable to 
send messages to this member via JGroups
{noformat}

But the network from the surviving-side -> the losing side was still open and 
13989 received messages from it:

{noformat}
system.log: [info 2018/08/21 21:56:48.084 PDT gemfire1_host1_13989 <Geode 
Failure Detection thread 1> tid=0x84] Final check failed but detected recent 
message traffic for suspect member 
10.32.111.37(gemfire4_host2_12365:12365:locator)<ec><v1>:1025

system.log: [info 2018/08/21 21:56:48.085 PDT gemfire1_host1_13989 <Geode 
Failure Detection thread 1> tid=0x84] Final check passed for suspect member 
10.32.111.37(gemfire4_host2_12365:12365:locator)<ec><v1>:1025
{noformat}

The health monitor eventually suspected the other surviving-side member and 
kicked it out, but it never performed another final check on 
gemfire4_host2_12365 and so did not shut down:

{noformat}
system.log: [info 2018/08/21 21:56:48.086 PDT gemfire1_host1_13989 <Geode 
Failure Detection thread 2> tid=0x85] Final check failed - requesting removal 
of suspect member 10.32.111.37(gemfire3_host2_12393:12393)<v2>:1026

system.log: [info 2018/08/21 21:56:50.536 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] Failure detection is now watching 
10.32.111.37(gemfire3_host2_12393:12393)<v2>:1026

system.log: [info 2018/08/21 21:56:50.536 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] Failure detection is now watching 
10.32.111.37(gemfire4_host2_12365:12365:locator)<ec><v1>:1025

system.log: [info 2018/08/21 21:56:53.036 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] Failure detection is now watching 
10.32.111.37(gemfire3_host2_12393:12393)<v2>:1026

system.log: [info 2018/08/21 21:56:53.036 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] Failure detection is now watching 
10.32.111.37(gemfire4_host2_12365:12365:locator)<ec><v1>:1025

system.log: [info 2018/08/21 21:56:55.536 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] Failure detection is now watching 
10.32.111.37(gemfire3_host2_12393:12393)<v2>:1026

system.log: [info 2018/08/21 21:56:58.036 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] Failure detection is now watching 
10.32.108.137(gemfire1_host1_13989:13989)<v14>:1024

system.log: [info 2018/08/21 21:56:58.036 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] Failure detection is now watching 
10.32.111.37(gemfire3_host2_12393:12393)<v2>:1026

system.log: [info 2018/08/21 21:57:00.536 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] Failure detection is now watching 
10.32.108.137(gemfire1_host1_13989:13989)<v14>:1024

system.log: [info 2018/08/21 21:57:00.536 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] Failure detection is now watching 
10.32.111.37(gemfire3_host2_12393:12393)<v2>:1026

system.log: [info 2018/08/21 21:57:03.036 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] Failure detection is now watching 
10.32.108.137(gemfire1_host1_13989:13989)<v14>:1024

system.log: [info 2018/08/21 21:57:03.036 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] Failure detection is now watching 
10.32.111.37(gemfire3_host2_12393:12393)<v2>:1026

system.log: [info 2018/08/21 21:57:05.536 PDT gemfire1_host1_13989 <Geode 
Failure Detection Scheduler> tid=0x22] All other members are suspect at this 
point
{noformat}


> suspected member is never watched again after passing final check
> -----------------------------------------------------------------
>
>                 Key: GEODE-3780
>                 URL: https://issues.apache.org/jira/browse/GEODE-3780
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>            Reporter: Bruce Schuchardt
>            Assignee: Bruce Schuchardt
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.7.0
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In a network-down test we saw a node on the losing side of the network 
> partition perform final checks on members on the winning side.  One of the 
> final checks mysteriously succeeded
> [info 2017/09/17 12:24:45.552 PDT 
> gemfire1_rs-FullRegression-2017-09-15-21-00-35-client-10_8941 <Geode Failure 
> Detection thread 4> tid=0x128] Final check failed but detected recent message 
> traffic for suspect member 
> 10.32.109.252(gemfire3_rs-FullRegression-2017-09-15-21-00-35-client-16_6135:6135)<v2>:1026
> [info 2017/09/17 12:24:45.552 PDT 
> gemfire1_rs-FullRegression-2017-09-15-21-00-35-client-10_8941 <Geode Failure 
> Detection thread 4> tid=0x128] Final check passed for suspect member 
> 10.32.109.252(gemfire3_rs-FullRegression-2017-09-15-21-00-35-client-16_6135:6135)<v2>:1026
> After this the suspected member was never checked again and the losing side 
> failed to shut down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Reopened] (GEODE-3780) suspected member is never watched again after passing final check

Reply via email to