[ https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bill Burcham reassigned GEODE-9402: ----------------------------------- Assignee: Jianxia Chen (was: Bill Burcham) > Automatic Reconnect Failure: Address already in use > --------------------------------------------------- > > Key: GEODE-9402 > URL: https://issues.apache.org/jira/browse/GEODE-9402 > Project: Geode > Issue Type: Bug > Components: membership > Reporter: Juan Ramos > Assignee: Jianxia Chen > Priority: Major > Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip > > > There are 2 locators and 4 servers during the test, once they're all up and > running the test drops the network connectivity between all members to > generate a full network partition and cause all members to shutdown and go > into reconnect mode. Upon reaching the mentioned state, the test > automatically restores the network connectivity and expects all members to > automatically go up again and re-form the distributed system. > This works fine most of the time, and we see every member successfully > reconnecting to the distributed system: > {noformat} > [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 <ReconnectThread> > tid=0x87] Reconnect completed. > [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 <ReconnectThread> > tid=0x86] Reconnect completed. > [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 <ReconnectThread> > tid=0x94] Reconnect completed. > [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 <ReconnectThread> > tid=0x96] Reconnect completed. > [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 <ReconnectThread> > tid=0x97] Reconnect completed. > [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 <ReconnectThread> > tid=0x95] Reconnect completed. > {noformat} > In some rare occasions, though, one of the servers fails during the reconnect > phase with the following exception: > {noformat} > [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 <ReconnectThread> > tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = > false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server > = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed > because: > org.apache.geode.GemFireIOException: While starting cache server CacheServer > on port=40404 client subscription config policy=none client subscription > config capacity=1 client subscription config overflow directory=. > at > org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800) > at > org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599) > at > org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339) > at > org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207) > at > org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197) > at > org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497) > at > org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449) > at > org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277) > at > org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.net.BindException: Address already in use (Bind failed) > at java.base/java.net.PlainSocketImpl.socketBind(Native Method) > at > java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436) > at java.base/java.net.ServerSocket.bind(ServerSocket.java:395) > at > org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70) > at > org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529) > at > org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:573) > at > org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291) > at > org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:420) > at > org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:377) > at > org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:796) > ... 14 more > {noformat} > It seems that the server is trying to bind the port before the old instance > has finished shutting down and cleaning up resources, causing the reconnect > process to halt and stop re-trying, and leaving the cluster with one less > member. > We've been able to reproduce the problem only twice in the past few weeks, > I've attached the two set of artefacts to the ticket: > - _*cluster_logs_pks_121*_: the member that throws the {{BindException}} > during reconnect is {{gemfire-cluster-server-1}}. > - _*cluster_logs_gke_latest_54*_: the member that throws the > {{BindException}} during reconnect is {{gemfire-cluster-server-0}}. -- This message was sent by Atlassian Jira (v8.20.7#820007)