[ https://issues.apache.org/jira/browse/GEODE-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494855#comment-17494855 ]
Barrett Oglesby edited comment on GEODE-9910 at 2/19/22, 12:42 AM: ------------------------------------------------------------------- Here is some analysis of this issue. h3. Server Addresses node 1: membership: 10.196.55.141(15661)<ec><v0>:42000 locator: 10.196.55.141:10335 node 2: membership: 10.196.55.142(19002)<ec><v1>:42000 locator: 10.196.55.142:10335 h3. Node2 Initial Disconnect node2 lost connectivity with node1 and removed it: {noformat} 2021-11-28 04:03:45,084 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] Availability check failed for member 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:03:45,084 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] Requesting removal of suspect member 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:03:45,085 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] This member is becoming the membership coordinator with address 10.196.55.142(19002)<ec><v1>:42000 {noformat} It then realized that quorum had been lost (node1 was coordinator with weight=15; node2 was not coordinator with weight=10): {noformat} 2021-11-28 04:03:45,091 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] View Creator thread is starting 2021-11-28 04:03:45,091 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] 10.196.55.141(15661)<ec><v0>:42000 had a weight of 15 2021-11-28 04:03:45,092 WARN [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] total weight lost in this view change is 15 of 25. Quorum has been lost! 2021-11-28 04:03:45,092 FATAL [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] Possible loss of quorum due to the loss of 1 cache processes: [10.196.55.141(15661)<ec><v0>:42000] {noformat} And disconnected itself from the distributed system: {noformat} 2021-11-28 04:03:46,093 FATAL [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] Membership service failure: Exiting due to possible network partition event due to loss of 1 cache processes: [10.196.55.141(15661)<ec><v0>:42000] org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException: Exiting due to possible network partition event due to loss of 1 cache processes: [10.196.55.141(15661)<ec><v0>:42000] at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:1787) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1122) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.access$1300(GMSJoinLeave.java:80) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.prepareAndSendView(GMSJoinLeave.java:2588) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.sendInitialView(GMSJoinLeave.java:2204) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.run(GMSJoinLeave.java:2286) [geode-membership-1.14.0.jar:?] {noformat} It stopped its locator: {noformat} 2021-11-28 04:03:46,794 INFO [org.apache.geode.distributed.internal.InternalLocator]-[ReconnectThread] [] Distribution Locator on vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142 is stopped{noformat} h3. Node2 Reconnect Attempt 1 {noformat} 2021-11-28 04:04:46,800 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Attempting to reconnect to the distributed system. This is attempt #1. {noformat} The first reconnect attempt failed to get quorum (it needed a weight of 13 but is 10): {noformat} 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] performing a quorum check to see if location services can be started early 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] beginning quorum check with GMSQuorumChecker on view View[10.196.55.141(15661)<ec><v0>:42000|1] members: [10.196.55.141(15661)<ec><v0>:42000{lead}, 10.196.55.142(19002)<ec><v1>:42000] 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: sending request to 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: sending request to 10.196.55.142(19002)<ec><v1>:42000 2021-11-28 04:04:46,812 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] received ping-pong response from /10.196.55.142<v1>:42000 2021-11-28 04:04:46,812 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: mapped address to member ID 10.196.55.142(19002)<ec><v1>:42000 2021-11-28 04:04:46,812 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: waiting up to 15000ms to receive a quorum of responses ... 2021-11-28 04:05:01,821 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: timeout waiting for responses. 1 responses received 2021-11-28 04:05:01,822 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: contacted 1 processes with 10 member weight units. Threshold for a quorum is 13 {noformat} Since quorum failed, the locator did not restart: {noformat} 2021-11-28 04:05:01,822 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] quorum check failed - not allowing location services to start early {noformat} It then was able to send a *FindCoordinatorRequest* and receive a *FindCoordinatorResponse* from 10.196.55.141:10335 (node1's locator). It then tries to join the distributed system which fails. It does this 4 times, but isn't able to join the distributed system: {noformat} 2021-11-28 04:05:02,078 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: java.net.ConnectException: Connection refused (Connection refused) 2021-11-28 04:05:06,946 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=1, registrants=[10.196.55.142(19002)<ec>:42000], senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|1] members: [10.196.55.141(15661)<ec><v0>:42000{lead}]) from locator HostAndPort[/10.196.55.141:10335] 2021-11-28 04:05:06,948 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join 2021-11-28 04:05:10,015 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these possible coordinators: [10.196.55.141(15661)<ec><v0>:42000] 2021-11-28 04:05:10,015 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Discovery state after looking for membership coordinator is locatorsContacted=1; findInViewResponses=0; alreadyTried=[]; registrants=[]; possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=1; hasContactedAJoinedLocator=true; view=View[10.196.55.141(15661)<ec><v0>:42000|1] members: [10.196.55.141(15661)<ec><v0>:42000{lead}]; responses=[] 2021-11-28 04:05:10,016 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000 *2021-11-28 04:05:10,016 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Attempting to join the distributed system through coordinator 10.196.55.141(15661)<ec><v0>:42000 using address 10.196.55.142(19002)<ec>:42000 2021-11-28 04:05:22,017 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: java.net.ConnectException: Connection refused (Connection refused) 2021-11-28 04:05:23,050 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000], senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]) from locator HostAndPort[/10.196.55.141:10335] 2021-11-28 04:05:23,052 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join 2021-11-28 04:05:26,111 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these possible coordinators: [10.196.55.141(15661)<ec><v0>:42000] 2021-11-28 04:05:26,111 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Discovery state after looking for membership coordinator is locatorsContacted=1; findInViewResponses=0; alreadyTried=[10.196.55.141(15661)<ec><v0>:42000]; registrants=[]; possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; hasContactedAJoinedLocator=true; view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]; responses=[] 2021-11-28 04:05:26,113 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:26,113 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response 2021-11-28 04:05:38,113 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: java.net.ConnectException: Connection refused (Connection refused) 2021-11-28 04:05:39,145 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000], senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]) from locator HostAndPort[/10.196.55.141:10335] 2021-11-28 04:05:39,147 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join 2021-11-28 04:05:42,207 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these possible coordinators: [10.196.55.141(15661)<ec><v0>:42000] 2021-11-28 04:05:42,207 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Discovery state after looking for membership coordinator is locatorsContacted=1; findInViewResponses=0; alreadyTried=[10.196.55.141(15661)<ec><v0>:42000]; registrants=[10.196.55.142(19002)<ec>:42000]; possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; hasContactedAJoinedLocator=true; view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]; responses=[] 2021-11-28 04:05:42,209 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:42,209 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response 2021-11-28 04:05:54,210 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: java.net.ConnectException: Connection refused (Connection refused) 2021-11-28 04:05:55,242 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000], senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]) from locator HostAndPort[/10.196.55.141:10335] 2021-11-28 04:05:55,244 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join 2021-11-28 04:05:58,303 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these possible coordinators: [10.196.55.141(15661)<ec><v0>:42000] 2021-11-28 04:05:58,303 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Discovery state after looking for membership coordinator is locatorsContacted=1; findInViewResponses=0; alreadyTried=[10.196.55.141(15661)<ec><v0>:42000]; registrants=[10.196.55.142(19002)<ec>:42000]; possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; hasContactedAJoinedLocator=true; view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]; responses=[] 2021-11-28 04:05:58,305 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:58,305 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response {noformat} * This message is coming from {*}GMSJoinLeave.attemptToJoin{*}. This fails since the ReconnectThread continues on. node1-vm.log contains these messages from *GMSLocator.processFindCoordinatorRequest* corresponding to the *FindCoordinatorRequests* received from node2: {noformat} 2021-11-28 04:05:06,898 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator request thread 1] [] Peer locator: coordinator from view is 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:23,045 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator request thread 1] [] Peer locator: coordinator from view is 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:39,141 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator request thread 1] [] Peer locator: coordinator from registrations is 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:55,238 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator request thread 1] [] Peer locator: coordinator from registrations is 10.196.55.141(15661)<ec><v0>:42000 {noformat} Since, node2 was unable to join the distributed system, it gave up: {noformat} 2021-11-28 04:06:10,309 WARN [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Caught SystemConnectException in reconnect org.apache.geode.SystemConnectException: Problem starting up membership services at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:186) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2326) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1187) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1811) ~[geode-membership-1.14.0.jar:?] Caused by: org.apache.geode.distributed.internal.membership.api.MemberStartupException: Unable to join the distributed system in 68228ms at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.join(GMSJoinLeave.java:411) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership.join(GMSMembership.java:533) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership.access$1200(GMSMembership.java:72) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.joinDistributedSystem(GMSMembership.java:1752) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:242) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1642) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171) ~[geode-core-1.14.0.jar:?] {noformat} Node2 Reconnect Attempt 2 {noformat} 2021-11-28 04:07:10,310 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Attempting to reconnect to the distributed system. This is attempt #2.{noformat} The second reconnect attempt succeeded in getting quorum (this means node2 was able to send a ping message to node1 and receive a pong message back: {noformat} 2021-11-28 04:07:10,319 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] performing a quorum check to see if location services can be started early 2021-11-28 04:07:10,319 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] beginning quorum check with GMSQuorumChecker on view View[10.196.55.141(15661)<ec><v0>:42000|1] members: [10.196.55.141(15661)<ec><v0>:42000{lead}, 10.196.55.142(19002)<ec><v1>:42000] 2021-11-28 04:07:10,319 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: sending request to 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:07:10,320 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: waiting up to 15000ms to receive a quorum of responses 2021-11-28 04:07:10,320 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[unicast receiver,vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2-26663] [] received ping-pong response from /10.196.55.141<v0>:42000 2021-11-28 04:07:10,321 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[unicast receiver,vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2-26663] [] quorum check: mapped address to member ID 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:07:10,820 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: received responses from all members that were in the old distributed system {noformat} It then started the locator: {noformat} 2021-11-28 04:07:10,820 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Quorum check passed - allowing location services to start early 2021-11-28 04:07:10,879 INFO [org.apache.geode.distributed.internal.InternalLocator]-[ReconnectThread] [] Starting peer location for Distribution Locator on vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142 2021-11-28 04:07:10,879 INFO [org.apache.geode.distributed.internal.tcpserver.TcpServer]-[ReconnectThread] [] Locator was created at Sun Nov 28 04:07:10 UTC 2021 2021-11-28 04:07:10,879 INFO [org.apache.geode.distributed.internal.tcpserver.TcpServer]-[ReconnectThread] [] Listening on port 10335 bound on address vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142 {noformat} And recovered state from node1's locator (this means node2 was able to send a *GetViewRequest* to node1 and receive a *GetViewResponse* back): {noformat} 2021-11-28 04:07:10,880 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread] [] GemFire peer location service starting. Other locators: 10.196.55.142[10335],10.196.55.141[10335],10.196.55.133[10335] Locators preferred as coordinators: true Network partition detection enabled: true View persistence file: /locator10335view.dat 2021-11-28 04:07:10,898 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread] [] Peer locator attempting to recover from HostAndPort[/10.196.55.141:10335] 2021-11-28 04:07:10,965 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread] [] Peer locator recovered initial membership of View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000] 2021-11-28 04:07:10,966 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread] [] Peer locator recovered state from HostAndPort[/10.196.55.141:10335] {noformat} It then had the same behavior with joining the distributed system as reconnect attempt 1: {noformat} 2021-11-28 04:07:11,153 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=-100, registrants=[10.196.55.142(19002)<ec>:42000], senderId=10.196.55.142(19002)<ec>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|-100] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]) from locator HostAndPort[/10.196.55.142:10335] 2021-11-28 04:07:11,224 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000, 10.196.55.142(19002)<ec>:42000], senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]) from locator HostAndPort[/10.196.55.141:10335] 2021-11-28 04:07:11,226 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join 2021-11-28 04:07:14,303 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these possible coordinators: [10.196.55.141(15661)<ec><v0>:42000] 2021-11-28 04:07:14,304 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Discovery state after looking for membership coordinator is locatorsContacted=2; findInViewResponses=0; alreadyTried=[]; registrants=[]; possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; hasContactedAJoinedLocator=true; view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]; responses=[] 2021-11-28 04:07:14,305 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:07:14,305 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Attempting to join the distributed system through coordinator 10.196.55.141(15661)<ec><v0>:42000 using address 10.196.55.142(19002)<ec>:42000 2021-11-28 04:07:29,409 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response 2021-11-28 04:07:44,516 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response 2021-11-28 04:07:59,617 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response {noformat} Since, it was unable to join the distributed system, it gave up: {noformat} 2021-11-28 04:08:11,622 WARN [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Caught SystemConnectException in reconnect {noformat} *However, the locator was not stopped.* h3. Node2 Reconnect Attempt 3 {noformat} 2021-11-28 04:09:11,623 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Attempting to reconnect to the distributed system. This is attempt #3. {noformat} The third reconnect attempt also succeeded in getting quorum: {noformat} 2021-11-28 04:09:11,631 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] performing a quorum check to see if location services can be started early 2021-11-28 04:09:11,631 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Quorum check passed - allowing location services to start early {noformat} Then, the exception occurred trying to start the locator since it was still running from reconnect attempt 2: {noformat} 2021-11-28 04:09:11,632 WARN [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Exception occurred while trying to connect the system during reconnect java.lang.IllegalStateException: A locator can not be created because one already exists in this JVM. at org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:298) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:273) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.startInitLocator(InternalDistributedSystem.java:916) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:768) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) ~[geode-core-1.14.0.jar:?] {noformat} I'm not sure why the servers were able to exchange some messages (e.g. {*}FindCoordinatorRequest/Response{*}, {*}GetViewRequest/Response{*}) but not others (e.g. {*}JoinRequestMessage/JoinResponseMessage{*}), but that could be just the state of the servers at the time. In any event, if a reconnect attempt fails, any started locator should be stopped. was (Author: barry.oglesby): Here is some analysis of this issue. h3. Server Addresses node 1: membership: 10.196.55.141(15661)<ec><v0>:42000 locator: 10.196.55.141:10335 node 2: membership: 10.196.55.142(19002)<ec><v1>:42000 locator: 10.196.55.142:10335 h3. Node2 Initial Disconnect node2 lost connectivity with node1 and removed it: {noformat} 2021-11-28 04:03:45,084 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] Availability check failed for member 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:03:45,084 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] Requesting removal of suspect member 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:03:45,085 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Failure Detection thread 9] [] This member is becoming the membership coordinator with address 10.196.55.142(19002)<ec><v1>:42000 {noformat} It then realized that quorum had been lost (node1 was coordinator with weight=15; node2 was not coordinator with weight=10): {noformat} 2021-11-28 04:03:45,091 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] View Creator thread is starting 2021-11-28 04:03:45,091 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] 10.196.55.141(15661)<ec><v0>:42000 had a weight of 15 2021-11-28 04:03:45,092 WARN [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] total weight lost in this view change is 15 of 25. Quorum has been lost! 2021-11-28 04:03:45,092 FATAL [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] Possible loss of quorum due to the loss of 1 cache processes: [10.196.55.141(15661)<ec><v0>:42000] {noformat} And disconnected itself from the distributed system: {noformat} 2021-11-28 04:03:46,093 FATAL [org.apache.geode.distributed.internal.membership.gms.Services]-[Geode Membership View Creator] [] Membership service failure: Exiting due to possible network partition event due to loss of 1 cache processes: [10.196.55.141(15661)<ec><v0>:42000] org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException: Exiting due to possible network partition event due to loss of 1 cache processes: [10.196.55.141(15661)<ec><v0>:42000] at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:1787) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1122) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.access$1300(GMSJoinLeave.java:80) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.prepareAndSendView(GMSJoinLeave.java:2588) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.sendInitialView(GMSJoinLeave.java:2204) [geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave$ViewCreator.run(GMSJoinLeave.java:2286) [geode-membership-1.14.0.jar:?] {noformat} It stopped its locator: {noformat} 2021-11-28 04:03:46,794 INFO [org.apache.geode.distributed.internal.InternalLocator]-[ReconnectThread] [] Distribution Locator on vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142 is stopped{noformat} h3. Node2 Reconnect Attempt 1 {noformat} 2021-11-28 04:04:46,800 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Attempting to reconnect to the distributed system. This is attempt #1. {noformat} The first reconnect attempt failed to get quorum (it needed a weight of 13 but is 10): {noformat} 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] performing a quorum check to see if location services can be started early 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] beginning quorum check with GMSQuorumChecker on view View[10.196.55.141(15661)<ec><v0>:42000|1] members: [10.196.55.141(15661)<ec><v0>:42000{lead}, 10.196.55.142(19002)<ec><v1>:42000] 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: sending request to 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:04:46,810 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: sending request to 10.196.55.142(19002)<ec><v1>:42000 2021-11-28 04:04:46,812 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] received ping-pong response from /10.196.55.142<v1>:42000 2021-11-28 04:04:46,812 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: mapped address to member ID 10.196.55.142(19002)<ec><v1>:42000 2021-11-28 04:04:46,812 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: waiting up to 15000ms to receive a quorum of responses ... 2021-11-28 04:05:01,821 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: timeout waiting for responses. 1 responses received 2021-11-28 04:05:01,822 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: contacted 1 processes with 10 member weight units. Threshold for a quorum is 13 {noformat} Since quorum failed, the locator did not restart: {noformat} 2021-11-28 04:05:01,822 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] quorum check failed - not allowing location services to start early {noformat} It then was able to send a *FindCoordinatorRequest* and receive a *FindCoordinatorResponse* from 10.196.55.141:10335 (node1's locator). It then tries to join the distributed system which fails. It does this 4 times, but isn't able to join the distributed system: {noformat} 2021-11-28 04:05:02,078 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: java.net.ConnectException: Connection refused (Connection refused) 2021-11-28 04:05:06,946 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=1, registrants=[10.196.55.142(19002)<ec>:42000], senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|1] members: [10.196.55.141(15661)<ec><v0>:42000{lead}]) from locator HostAndPort[/10.196.55.141:10335] 2021-11-28 04:05:06,948 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join 2021-11-28 04:05:10,015 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these possible coordinators: [10.196.55.141(15661)<ec><v0>:42000] 2021-11-28 04:05:10,015 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Discovery state after looking for membership coordinator is locatorsContacted=1; findInViewResponses=0; alreadyTried=[]; registrants=[]; possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=1; hasContactedAJoinedLocator=true; view=View[10.196.55.141(15661)<ec><v0>:42000|1] members: [10.196.55.141(15661)<ec><v0>:42000{lead}]; responses=[] 2021-11-28 04:05:10,016 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000 *2021-11-28 04:05:10,016 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Attempting to join the distributed system through coordinator 10.196.55.141(15661)<ec><v0>:42000 using address 10.196.55.142(19002)<ec>:42000 2021-11-28 04:05:22,017 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: java.net.ConnectException: Connection refused (Connection refused) 2021-11-28 04:05:23,050 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000], senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]) from locator HostAndPort[/10.196.55.141:10335] 2021-11-28 04:05:23,052 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join 2021-11-28 04:05:26,111 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these possible coordinators: [10.196.55.141(15661)<ec><v0>:42000] 2021-11-28 04:05:26,111 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Discovery state after looking for membership coordinator is locatorsContacted=1; findInViewResponses=0; alreadyTried=[10.196.55.141(15661)<ec><v0>:42000]; registrants=[]; possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; hasContactedAJoinedLocator=true; view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]; responses=[] 2021-11-28 04:05:26,113 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:26,113 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response 2021-11-28 04:05:38,113 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: java.net.ConnectException: Connection refused (Connection refused) 2021-11-28 04:05:39,145 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000], senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]) from locator HostAndPort[/10.196.55.141:10335] 2021-11-28 04:05:39,147 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join 2021-11-28 04:05:42,207 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these possible coordinators: [10.196.55.141(15661)<ec><v0>:42000] 2021-11-28 04:05:42,207 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Discovery state after looking for membership coordinator is locatorsContacted=1; findInViewResponses=0; alreadyTried=[10.196.55.141(15661)<ec><v0>:42000]; registrants=[10.196.55.142(19002)<ec>:42000]; possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; hasContactedAJoinedLocator=true; view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]; responses=[] 2021-11-28 04:05:42,209 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:42,209 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response 2021-11-28 04:05:54,210 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Unable to contact locator HostAndPort[/10.196.55.142:10335]: java.net.ConnectException: Connection refused (Connection refused) 2021-11-28 04:05:55,242 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000], senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]) from locator HostAndPort[/10.196.55.141:10335] 2021-11-28 04:05:55,244 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join 2021-11-28 04:05:58,303 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these possible coordinators: [10.196.55.141(15661)<ec><v0>:42000] 2021-11-28 04:05:58,303 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Discovery state after looking for membership coordinator is locatorsContacted=1; findInViewResponses=0; alreadyTried=[10.196.55.141(15661)<ec><v0>:42000]; registrants=[10.196.55.142(19002)<ec>:42000]; possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; hasContactedAJoinedLocator=true; view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]; responses=[] 2021-11-28 04:05:58,305 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:58,305 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response {noformat} * This message is coming from {*}GMSJoinLeave.attemptToJoin{*}. This fails since the ReconnectThread continues on. node1-vm.log contains these messages from *GMSLocator.processFindCoordinatorRequest* corresponding to the *FindCoordinatorRequests* received from node2: {noformat} 2021-11-28 04:05:06,898 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator request thread 1] [] Peer locator: coordinator from view is 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:23,045 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator request thread 1] [] Peer locator: coordinator from view is 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:39,141 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator request thread 1] [] Peer locator: coordinator from registrations is 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:05:55,238 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[locator request thread 1] [] Peer locator: coordinator from registrations is 10.196.55.141(15661)<ec><v0>:42000 {noformat} Since, node2 was unable to join the distributed system, it gave up: {noformat} 2021-11-28 04:06:10,309 WARN [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Caught SystemConnectException in reconnect org.apache.geode.SystemConnectException: Problem starting up membership services at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:186) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2326) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1187) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1811) ~[geode-membership-1.14.0.jar:?] Caused by: org.apache.geode.distributed.internal.membership.api.MemberStartupException: Unable to join the distributed system in 68228ms at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.join(GMSJoinLeave.java:411) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership.join(GMSMembership.java:533) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership.access$1200(GMSMembership.java:72) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.joinDistributedSystem(GMSMembership.java:1752) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:242) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1642) ~[geode-membership-1.14.0.jar:?] at org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171) ~[geode-core-1.14.0.jar:?] {noformat} Node2 Reconnect Attempt 2 {noformat} 2021-11-28 04:07:10,310 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Attempting to reconnect to the distributed system. This is attempt #2.{noformat} The second reconnect attempt succeeded in getting quorum (this means node2 was able to send a ping message to node1 and receive a pong message back: {noformat} 2021-11-28 04:07:10,319 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] performing a quorum check to see if location services can be started early 2021-11-28 04:07:10,319 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] beginning quorum check with GMSQuorumChecker on view View[10.196.55.141(15661)<ec><v0>:42000|1] members: [10.196.55.141(15661)<ec><v0>:42000{lead}, 10.196.55.142(19002)<ec><v1>:42000] 2021-11-28 04:07:10,319 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: sending request to 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:07:10,320 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: waiting up to 15000ms to receive a quorum of responses 2021-11-28 04:07:10,320 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[unicast receiver,vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2-26663] [] received ping-pong response from /10.196.55.141<v0>:42000 2021-11-28 04:07:10,321 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[unicast receiver,vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2-26663] [] quorum check: mapped address to member ID 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:07:10,820 INFO [org.apache.geode.distributed.internal.membership.gms.messenger.GMSQuorumChecker]-[ReconnectThread] [] quorum check: received responses from all members that were in the old distributed system {noformat} It then started the locator: {noformat} 2021-11-28 04:07:10,820 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Quorum check passed - allowing location services to start early 2021-11-28 04:07:10,879 INFO [org.apache.geode.distributed.internal.InternalLocator]-[ReconnectThread] [] Starting peer location for Distribution Locator on vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142 2021-11-28 04:07:10,879 INFO [org.apache.geode.distributed.internal.tcpserver.TcpServer]-[ReconnectThread] [] Locator was created at Sun Nov 28 04:07:10 UTC 2021 2021-11-28 04:07:10,879 INFO [org.apache.geode.distributed.internal.tcpserver.TcpServer]-[ReconnectThread] [] Listening on port 10335 bound on address vmw-hcs-248e71fd-dd76-4111-ba82-379151aabbb7-3000-1-node-2/10.196.55.142 {noformat} And recovered state from node1's locator (this means node2 was able to send a *GetViewRequest* to node1 and receive a *GetViewResponse* back): {noformat} 2021-11-28 04:07:10,880 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread] [] GemFire peer location service starting. Other locators: 10.196.55.142[10335],10.196.55.141[10335],10.196.55.133[10335] Locators preferred as coordinators: true Network partition detection enabled: true View persistence file: /locator10335view.dat 2021-11-28 04:07:10,898 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread] [] Peer locator attempting to recover from HostAndPort[/10.196.55.141:10335] 2021-11-28 04:07:10,965 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread] [] Peer locator recovered initial membership of View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000] 2021-11-28 04:07:10,966 INFO [org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator]-[ReconnectThread] [] Peer locator recovered state from HostAndPort[/10.196.55.141:10335] {noformat} It then had the same behavior with joining the distributed system as reconnect attempt 1: {noformat} 2021-11-28 04:07:11,153 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=-100, registrants=[10.196.55.142(19002)<ec>:42000], senderId=10.196.55.142(19002)<ec>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|-100] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]) from locator HostAndPort[/10.196.55.142:10335] 2021-11-28 04:07:11,224 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] received FindCoordinatorResponse(coordinator=10.196.55.141(15661)<ec><v0>:42000, fromView=true, viewId=2, registrants=[10.196.55.142(19002)<ec>:42000, 10.196.55.142(19002)<ec>:42000], senderId=10.196.55.141(15661)<ec><v0>:42000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]) from locator HostAndPort[/10.196.55.141:10335] 2021-11-28 04:07:11,226 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join 2021-11-28 04:07:14,303 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] findCoordinator chose 10.196.55.141(15661)<ec><v0>:42000 out of these possible coordinators: [10.196.55.141(15661)<ec><v0>:42000] 2021-11-28 04:07:14,304 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Discovery state after looking for membership coordinator is locatorsContacted=2; findInViewResponses=0; alreadyTried=[]; registrants=[]; possibleCoordinator=10.196.55.141(15661)<ec><v0>:42000; viewId=2; hasContactedAJoinedLocator=true; view=View[10.196.55.141(15661)<ec><v0>:42000|2] members: [10.196.55.141(15661)<ec><v0>:42000{lead}] crashed: [10.196.55.142(19002)<ec><v1>:42000]; responses=[] 2021-11-28 04:07:14,305 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] found possible coordinator 10.196.55.141(15661)<ec><v0>:42000 2021-11-28 04:07:14,305 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Attempting to join the distributed system through coordinator 10.196.55.141(15661)<ec><v0>:42000 using address 10.196.55.142(19002)<ec>:42000 2021-11-28 04:07:29,409 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response 2021-11-28 04:07:44,516 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response 2021-11-28 04:07:59,617 INFO [org.apache.geode.distributed.internal.membership.gms.Services]-[ReconnectThread] [] Probable coordinator is still 10.196.55.141(15661)<ec><v0>:42000 - waiting for a join-response {noformat} Since, it was unable to join the distributed system, it gave up: {noformat} 2021-11-28 04:08:11,622 WARN [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Caught SystemConnectException in reconnect {noformat} *However, the locator was not stopped.* h3. Node2 Reconnect Attempt 3 {noformat} 2021-11-28 04:09:11,623 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Attempting to reconnect to the distributed system. This is attempt #3. {noformat} The third reconnect attempt also succeeded in getting quorum: {noformat} 2021-11-28 04:09:11,631 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] performing a quorum check to see if location services can be started early 2021-11-28 04:09:11,631 INFO [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Quorum check passed - allowing location services to start early {noformat} Then, the exception occurred trying to start the locator since it was still running from reconnect attempt 2: {noformat} 2021-11-28 04:09:11,632 WARN [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] [] Exception occurred while trying to connect the system during reconnect java.lang.IllegalStateException: A locator can not be created because one already exists in this JVM. at org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:298) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:273) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.startInitLocator(InternalDistributedSystem.java:916) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:768) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) ~[geode-core-1.14.0.jar:?] at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) ~[geode-core-1.14.0.jar:?] {noformat} I'm not sure why the servers were able to exchange some messages (e.g. {*}FindCoordinatorRequest/Response{*}, {*}GetViewRequest/Response{*}) but not others (e.g. {*}JoinRequestMessage/JoinResponseMessage{*}), but that could be just the state of the servers at the time. In any event, if a reconnect attempt fails, any started locator should be stopped. > Failure to auto-reconnect upon network partition > ------------------------------------------------ > > Key: GEODE-9910 > URL: https://issues.apache.org/jira/browse/GEODE-9910 > Project: Geode > Issue Type: Bug > Affects Versions: 1.14.0 > Reporter: Surya Mudundi > Assignee: Barrett Oglesby > Priority: Major > Labels: GeodeOperationAPI, blocks-1.15.0, needsTriage > Attachments: geode-logs.zip > > > Two node cluster with embedded locators failed to auto-reconnect when node-1 > experienced network outage for couple of minutes and when node-1 recovered > from the outage, node-2 failed to auto-reconnect. > node-2 tried to re-connect to node-1 as: > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #1. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #2. > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Attempting to reconnect to the distributed system. This is attempt #3. > Finally reported below error after 3 attempts as: > INFO > [org.apache.geode.logging.internal.LoggingProviderLoader]-[ReconnectThread] > [] Using org.apache.geode.logging.internal.SimpleLoggingProvider for service > org.apache.geode.logging.internal.spi.LoggingProvider > INFO [org.apache.geode.internal.InternalDataSerializer]-[ReconnectThread] [] > initializing InternalDataSerializer with 0 services > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] performing a quorum check to see if location services can be started early > INFO > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Quorum check passed - allowing location services to start early > WARN > [org.apache.geode.distributed.internal.InternalDistributedSystem]-[ReconnectThread] > [] Exception occurred while trying to connect the system during reconnect > java.lang.IllegalStateException: A locator can not be created because one > already exists in this JVM. > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:298) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalLocator.createLocator(InternalLocator.java:273) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.startInitLocator(InternalDistributedSystem.java:916) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:768) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2326) > ~[geode-core-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1187) > ~[geode-membership-1.14.0.jar:?] > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1811) > ~[geode-membership-1.14.0.jar:?] > at java.lang.Thread.run(Thread.java:829) [?:?] > -- This message was sent by Atlassian Jira (v8.20.1#820001)