[ https://issues.apache.org/jira/browse/GEODE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce J Schuchardt resolved GEODE-8467. --------------------------------------- Fix Version/s: 1.13.0 Resolution: Fixed Bugnote: This resolves an issue with auto-reconnect that could leave a server in a hung state. > server fails to notify of a ForcedDisconnect and fails to tear down the cache > ----------------------------------------------------------------------------- > > Key: GEODE-8467 > URL: https://issues.apache.org/jira/browse/GEODE-8467 > Project: Geode > Issue Type: Bug > Components: membership > Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0, 1.14.0 > Reporter: Bruce J Schuchardt > Assignee: Bruce J Schuchardt > Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > > A test having auto-reconnect enabled failed while restarting a server and > hung. The restarting server was building its cache when it was kicked out of > the cluster due to very high load on the test machine. Membership initiated > a forced-disconnect > {noformat} > [fatal 2020/08/22 00:51:04.508 PDT <unicast > receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] > Membership service failure: Member isn't responding to heartbeat requests > org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException: > Member isn't responding to heartbeat requests > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2012) > at > org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1085) > at > org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:688) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1331) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1267) > {noformat} > > and then logged that it was generating a description of the cache > {noformat} > [info 2020/08/22 00:51:05.933 PDT <unicast > receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] > generating XML to rebuild the cache after reconnect completes {noformat} > > but it never logged completion of this step and never forked a thread to tear > down the cache. Any exception thrown by XML generation would have been > caught by JGroups code, which logs the problem at a WARNING level. We have > JGroups logging set to FATAL level so you wouldn't see the issue. > We need to add exception handling around XML generation and, if detected, > disable reconnect attempts and have the server shut down. > The bug isn't easy to hit. I've run the test that failed over 5000 times > without encountering it. -- This message was sent by Atlassian Jira (v8.3.4#803005)