[ https://issues.apache.org/jira/browse/GEODE-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295377#comment-17295377 ]
Bruce J Schuchardt commented on GEODE-9000: ------------------------------------------- The server was reconnecting and emptying out messages queued during quorum checks: {noformat} logsAndStats/gemfire-cluster-server-0-02-01.log: [info 2021/03/04 10:30:28.595 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8c] Delivering 22 messages queued by quorum checker logsAndStats/gemfire-cluster-server-0-02-01.log: [info 2021/03/04 10:30:28.596 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8c] received suspect message from 10.4.2.34(:locator)<ec><v0>:41000 for 10.4.3.19(gemfire-cluster-locator-0:1:locator)<ec><v1>:41000: Member isn't responding to heartbeat requests [fatal 2021/03/04 10:30:28.596 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8c] Unexpected exception while booting membership services java.lang.NullPointerException at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459) at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343) at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428) {noformat} The network-partition message was delivered during this time and was likely intended for the previous Membership service. Adding a check for "isJoined" or a null currentView and ignoring the message is probably the right way to fix this problem. > NPE During Reconnect After Network Split > ---------------------------------------- > > Key: GEODE-9000 > URL: https://issues.apache.org/jira/browse/GEODE-9000 > Project: Geode > Issue Type: Bug > Components: membership > Affects Versions: 1.14.0 > Reporter: Juan Ramos > Priority: Major > > During a full network split when all members get shutdown by a partition, one > of the servers continually fails to reconnect due to a > {{NullPointerException}}. When using persistent regions, this also prevents > the remaining members from correctly start up as they might be waiting for > the stuck member to recover the latest data. > The issue itself has been introduced by the fix for GEODE-8901, the new > implementation for {{GMSJoinLeave.processNetworkPartitionMessage}} doesn't > have a {{currentView}} installed during the reconnect phase ({{getView() == > null}}) and the following is shown in the logs: > {noformat} > [fatal 2021/03/04 03:32:02.744 GMT gemfire-cluster-server-0 <ReconnectThread> > tid=0x8a] Unexpected exception while booting membership services > java.lang.NullPointerException > at > org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428) > at > org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782) > at > org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171) > at > org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951) > at java.base/java.lang.Thread.run(Thread.java:834) > [error 2021/03/04 03:32:02.747 GMT gemfire-cluster-server-0 <ReconnectThread> > tid=0x8a] Unexpected problem starting up membership services > java.lang.NullPointerException > at > org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428) > at > org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782) > at > org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171) > at > org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951) > at java.base/java.lang.Thread.run(Thread.java:834) > [warn 2021/03/04 03:32:02.748 GMT gemfire-cluster-server-0 <ReconnectThread> > tid=0x8a] Caught SystemConnectException in reconnect > org.apache.geode.SystemConnectException: Problem starting up membership > services: null. Consult log file for more details > at > org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:189) > at > org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951) > at java.base/java.lang.Thread.run(Thread.java:834) > [info 2021/03/04 03:32:02.749 GMT gemfire-cluster-server-0 <ReconnectThread> > tid=0x8a] Disconnecting old DistributedSystem to prepare for a reconnect > attempt > {noformat} > The above keeps happening during further reconnect attempts and the server > member can't re-join the distributed system. -- This message was sent by Atlassian Jira (v8.3.4#803005)