[jira] [Resolved] (GEODE-8467) server fails to notify of a ForcedDisconnect and fails to tear down the cache

Bruce J Schuchardt (Jira) Tue, 01 Sep 2020 11:20:58 -0700


     [ 
https://issues.apache.org/jira/browse/GEODE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bruce J Schuchardt resolved GEODE-8467.
---------------------------------------
    Fix Version/s: 1.13.0
       Resolution: Fixed

Bugnote: This resolves an issue with auto-reconnect that could leave a server 
in a hung state.

> server fails to notify of a ForcedDisconnect and fails to tear down the cache
> -----------------------------------------------------------------------------
>
>                 Key: GEODE-8467
>                 URL: https://issues.apache.org/jira/browse/GEODE-8467
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0, 1.14.0
>            Reporter: Bruce J Schuchardt
>            Assignee: Bruce J Schuchardt
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.13.0
>
>
> A test having auto-reconnect enabled failed while restarting a server and 
> hung.  The restarting server was building its cache when it was kicked out of 
> the cluster due to very high load on the test machine.  Membership initiated 
> a forced-disconnect
> {noformat}
> [fatal 2020/08/22 00:51:04.508 PDT <unicast 
> receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] 
> Membership service failure: Member isn't responding to heartbeat requests
> org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException:
>  Member isn't responding to heartbeat requests
>         at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2012)
>         at 
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1085)
>         at 
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:688)
>         at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1331)
>         at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1267)
>  {noformat}
>  
> and then logged that it was generating a description of the cache
> {noformat}
> [info 2020/08/22 00:51:05.933 PDT <unicast 
> receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] 
> generating XML to rebuild the cache after reconnect completes {noformat}
>  
> but it never logged completion of this step and never forked a thread to tear 
> down the cache.  Any exception thrown by XML generation would have been 
> caught by JGroups code, which logs the problem at a WARNING level.  We have 
> JGroups logging set to FATAL level so you wouldn't see the issue.
> We need to add exception handling around XML generation and, if detected, 
> disable reconnect attempts and have the server shut down.
> The bug isn't easy to hit.  I've run the test that failed over 5000 times 
> without encountering it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (GEODE-8467) server fails to notify of a ForcedDisconnect and fails to tear down the cache

Reply via email to