[ 
https://issues.apache.org/jira/browse/GEODE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197082#comment-17197082
 ] 

ASF subversion and git services commented on GEODE-8467:
--------------------------------------------------------

Commit c48c0c378f90bb2912e018856a1f6e3a46a610e8 in geode's branch 
refs/heads/develop from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=c48c0c3 ]

GEODE-8473: Hang in ReplyProcessor21 when forced-disconnect does not establish 
a cancellation cause (#5491)

Ensure that the cache is informed of a forced-disconnect in the
DisconnectThread.  This is a follow-on commit to GEODE-8467, which
ensured that the DisconnectThread is launched in the presence of cache
XML generation failure.  This commit adds a try/catch in
GMSMembership.uncleanShutdown() to ensure that the up-stream
ClusterDistributionManager is informed of the failure so it can set the
"rootCause" in its CancelCriterion.  ReplyProcessor21 and other objects
that poll for this "rootCause" will then be released from waiting for
responses to messages sent to other members of the cluster.

> server fails to notify of a ForcedDisconnect and fails to tear down the cache
> -----------------------------------------------------------------------------
>
>                 Key: GEODE-8467
>                 URL: https://issues.apache.org/jira/browse/GEODE-8467
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0, 1.14.0
>            Reporter: Bruce J Schuchardt
>            Assignee: Bruce J Schuchardt
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.13.0, 1.14.0
>
>
> A test having auto-reconnect enabled failed while restarting a server and 
> hung.  The restarting server was building its cache when it was kicked out of 
> the cluster due to very high load on the test machine.  Membership initiated 
> a forced-disconnect
> {noformat}
> [fatal 2020/08/22 00:51:04.508 PDT <unicast 
> receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] 
> Membership service failure: Member isn't responding to heartbeat requests
> org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException:
>  Member isn't responding to heartbeat requests
>         at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2012)
>         at 
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1085)
>         at 
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:688)
>         at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1331)
>         at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1267)
>  {noformat}
>  
> and then logged that it was generating a description of the cache
> {noformat}
> [info 2020/08/22 00:51:05.933 PDT <unicast 
> receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] 
> generating XML to rebuild the cache after reconnect completes {noformat}
>  
> but it never logged completion of this step and never forked a thread to tear 
> down the cache.  Any exception thrown by XML generation would have been 
> caught by JGroups code, which logs the problem at a WARNING level.  We have 
> JGroups logging set to FATAL level so you wouldn't see the issue.
> We need to add exception handling around XML generation and, if detected, 
> disable reconnect attempts and have the server shut down.
> The bug isn't easy to hit.  I've run the test that failed over 5000 times 
> without encountering it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to