[ 
https://issues.apache.org/jira/browse/GEODE-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295377#comment-17295377
 ] 

Bruce J Schuchardt commented on GEODE-9000:
-------------------------------------------

The server was reconnecting and emptying out messages queued during quorum 
checks:

{noformat}
logsAndStats/gemfire-cluster-server-0-02-01.log: [info 2021/03/04 10:30:28.595 
GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8c] Delivering 22 messages 
queued by quorum checker

logsAndStats/gemfire-cluster-server-0-02-01.log: [info 2021/03/04 10:30:28.596 
GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x8c] received suspect 
message from 10.4.2.34(:locator)<ec><v0>:41000 for 
10.4.3.19(gemfire-cluster-locator-0:1:locator)<ec><v1>:41000: Member isn't 
responding to heartbeat requests

[fatal 2021/03/04 10:30:28.596 GMT gemfire-cluster-server-0 <ReconnectThread> 
tid=0x8c] Unexpected exception while booting membership services
java.lang.NullPointerException
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428)
{noformat}

The network-partition message was delivered during this time and was likely 
intended for the previous Membership service.  Adding a check for "isJoined" or 
a null currentView and ignoring the message is probably the right way to fix 
this problem.

> NPE During Reconnect After Network Split
> ----------------------------------------
>
>                 Key: GEODE-9000
>                 URL: https://issues.apache.org/jira/browse/GEODE-9000
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.14.0
>            Reporter: Juan Ramos
>            Priority: Major
>
> During a full network split when all members get shutdown by a partition, one 
> of the servers continually fails to reconnect due to a 
> {{NullPointerException}}. When using persistent regions, this also prevents 
> the remaining members from correctly start up as they might be waiting for 
> the stuck member to recover the latest data.
> The issue itself has been introduced by the fix for GEODE-8901, the new 
> implementation for {{GMSJoinLeave.processNetworkPartitionMessage}} doesn't 
> have a {{currentView}} installed during the reconnect phase ({{getView() == 
> null}}) and the following is shown in the logs:
> {noformat}
> [fatal 2021/03/04 03:32:02.744 GMT gemfire-cluster-server-0 <ReconnectThread> 
> tid=0x8a] Unexpected exception while booting membership services
> java.lang.NullPointerException
>       at 
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459)
>       at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343)
>       at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428)
>       at 
> org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782)
>       at 
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
>       at 
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> [error 2021/03/04 03:32:02.747 GMT gemfire-cluster-server-0 <ReconnectThread> 
> tid=0x8a] Unexpected problem starting up membership services
> java.lang.NullPointerException
>       at 
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459)
>       at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343)
>       at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428)
>       at 
> org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782)
>       at 
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
>       at 
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> [warn 2021/03/04 03:32:02.748 GMT gemfire-cluster-server-0 <ReconnectThread> 
> tid=0x8a] Caught SystemConnectException in reconnect
> org.apache.geode.SystemConnectException: Problem starting up membership 
> services: null.  Consult log file for more details
>       at 
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:189)
>       at 
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> [info 2021/03/04 03:32:02.749 GMT gemfire-cluster-server-0 <ReconnectThread> 
> tid=0x8a] Disconnecting old DistributedSystem to prepare for a reconnect 
> attempt
> {noformat}
> The above keeps happening during further reconnect attempts and the server 
> member can't re-join the distributed system.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to