[ 
https://issues.apache.org/jira/browse/GEODE-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamilla Aslami resolved GEODE-8901.
-----------------------------------
    Fix Version/s: 1.14.0
       Resolution: Fixed

> Surviving side server forcefully disconnected after network drop
> ----------------------------------------------------------------
>
>                 Key: GEODE-8901
>                 URL: https://issues.apache.org/jira/browse/GEODE-8901
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.14.0
>            Reporter: Kamilla Aslami
>            Assignee: Kamilla Aslami
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.14.0
>
>
> During a network partition, locator-0 and server-0 were partitioned from the 
> other members of the DS (locator-1, server-1, server-2 (leadMember), 
> server-3). We see the expected "Operation not permitted" Exceptions (in 
> locator-0) for the 4 surviving side members:
>  
> {code:java}
> [warn 2020/12/16 23:14:02.827 GMT <Geode Failure Detection thread 2> 
> tid=0x78] Unable to send message to 
> 10.108.1.130(gemfire-cluster-server-2:1)<v2>:41000
> java.io.IOException: Operation not permitted
> [warn 2020/12/16 23:14:02.938 GMT <Geode Heartbeat Sender> tid=0x22] Unable 
> to send message to 
> 10.108.3.134(gemfire-cluster-locator-1:1:locator)<ec><v0>:41000
> java.io.IOException: Operation not permitted
> [warn 2020/12/16 23:14:06.701 GMT <Geode Membership View Creator> tid=0x79] 
> Unable to send message to 10.108.3.135(gemfire-cluster-server-1:1)<v4>:41000
> java.io.IOException: Operation not permitted
> [warn 2020/12/16 23:14:10.322 GMT <Geode Failure Detection thread 3> 
> tid=0x7a] Unable to send message to 
> 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000
> java.io.IOException: Operation not permitted
> {code}
> As expected, we see the loss of quorum:
> {noformat}
> [warn 2020/12/16 23:14:11.718 GMT <Geode Membership View Creator> tid=0x79] 
> total weight lost in this view change is 28 of 51.  Quorum has been 
> lost!{noformat}
> However, we expected to see a lost weight of 38 (10 + 15 + 10 + 3) for 
> server-1, server-2, server-3 and locator-1, respectively. What we do see is 
> that server-3 gets forcefully disconnected as well – that might occur because 
> after the "Operation not permitted" Exception above, we pass an availability 
> check.
> {noformat}
> [info 2020/12/16 23:14:10.323 GMT <Geode Failure Detection thread 3> 
> tid=0x7a] Performing availability check for suspect member 
> 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000 reason=Unable to send 
> messages to this member via JGroups
> ...
> [warn 2020/12/16 23:14:11.711 GMT <Geode Membership View Creator> tid=0x79] 
> these members failed to respond to the view change: 
> [10.108.3.134(gemfire-cluster-locator-1:1:locator)<ec><v0>:41000, 
> 10.108.3.135(gemfire-cluster-server-1:1)<v4>:41000, 
> 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000, 
> 10.108.1.130(gemfire-cluster-server-2:1)<v2>:41000]
> [info 2020/12/16 23:14:11.714 GMT <Geode View Creator verification thread 1> 
> tid=0x7c] checking state of member 
> 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000
> [info 2020/12/16 23:14:11.714 GMT <Geode View Creator verification thread 1> 
> tid=0x7c] member 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000 passed 
> availability check{noformat}
> This issue looks similar to GEODE-8721 which has been fixed in 
> b7afc604b9c2fafe4388dcdcf05fc7ec49c0ce86, but the failure logs don't contain 
> the logging relevant to GEODE-8721:
> {noformat}
> Availability check detected recent message traffic for suspect 
> member{noformat}
> This has a time stamp showing the time of contact. In GEODE-8721 we see the 
> timestamp being continually updated.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to