Juan Ramos created GEODE-9000:
---------------------------------

             Summary: NPE During Reconnect After Network Split
                 Key: GEODE-9000
                 URL: https://issues.apache.org/jira/browse/GEODE-9000
             Project: Geode
          Issue Type: Bug
          Components: membership
    Affects Versions: 1.14.0
            Reporter: Juan Ramos


During a full network split when all members get shutdown by a partition, one 
of the servers continually fails to reconnect due to a 
{{NullPointerException}}. When using persistent regions, this also prevents the 
remaining members from correctly start up as they might be waiting for the 
stuck member to recover the latest data.
The issue itself has been introduced by the fix for GEODE-8901, the new 
implementation for {{GMSJoinLeave.processNetworkPartitionMessage}} doesn't have 
a {{currentView}} installed during the reconnect phase ({{getView() == null}}) 
and the following is shown in the logs:

{noformat}
[fatal 2021/03/04 03:32:02.744 GMT gemfire-cluster-server-0 <ReconnectThread> 
tid=0x8a] Unexpected exception while booting membership services
java.lang.NullPointerException
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428)
        at 
org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782)
        at 
org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
        at 
org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
        at java.base/java.lang.Thread.run(Thread.java:834)

[error 2021/03/04 03:32:02.747 GMT gemfire-cluster-server-0 <ReconnectThread> 
tid=0x8a] Unexpected problem starting up membership services
java.lang.NullPointerException
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428)
        at 
org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782)
        at 
org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
        at 
org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
        at java.base/java.lang.Thread.run(Thread.java:834)

[warn 2021/03/04 03:32:02.748 GMT gemfire-cluster-server-0 <ReconnectThread> 
tid=0x8a] Caught SystemConnectException in reconnect
org.apache.geode.SystemConnectException: Problem starting up membership 
services: null.  Consult log file for more details
        at 
org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:189)
        at 
org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
        at 
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
        at 
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
        at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
        at java.base/java.lang.Thread.run(Thread.java:834)

[info 2021/03/04 03:32:02.749 GMT gemfire-cluster-server-0 <ReconnectThread> 
tid=0x8a] Disconnecting old DistributedSystem to prepare for a reconnect attempt
{noformat}

The above keeps happening during further reconnect attempts and the server 
member can't re-join the distributed system.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to