[
https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529123#comment-17529123
]
Bill Burcham commented on GEODE-9402:
-------------------------------------
A shortcoming of my testing around this problem is that my new test/experiment
isn't starting the cache server from cache XML. I notice we have a test for
that scenario: ReconnectWithCacheXMLDUnitTest and that test mentions
GEODE-2732. If you look at that ticket you'll see a BindException. It's got me
thinking perhaps a problem (this problem) remains when reconnecting from a
server started with cache XML.
> Automatic Reconnect Failure: Address already in use
> ---------------------------------------------------
>
> Key: GEODE-9402
> URL: https://issues.apache.org/jira/browse/GEODE-9402
> Project: Geode
> Issue Type: Bug
> Components: membership
> Reporter: Juan Ramos
> Assignee: Bill Burcham
> Priority: Major
> Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip
>
>
> There are 2 locators and 4 servers during the test, once they're all up and
> running the test drops the network connectivity between all members to
> generate a full network partition and cause all members to shutdown and go
> into reconnect mode. Upon reaching the mentioned state, the test
> automatically restores the network connectivity and expects all members to
> automatically go up again and re-form the distributed system.
> This works fine most of the time, and we see every member successfully
> reconnecting to the distributed system:
> {noformat}
> [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 <ReconnectThread>
> tid=0x87] Reconnect completed.
> [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 <ReconnectThread>
> tid=0x86] Reconnect completed.
> [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 <ReconnectThread>
> tid=0x94] Reconnect completed.
> [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 <ReconnectThread>
> tid=0x96] Reconnect completed.
> [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 <ReconnectThread>
> tid=0x97] Reconnect completed.
> [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 <ReconnectThread>
> tid=0x95] Reconnect completed.
> {noformat}
> In some rare occasions, though, one of the servers fails during the reconnect
> phase with the following exception:
> {noformat}
> [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 <ReconnectThread>
> tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing =
> false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server
> = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed
> because:
> org.apache.geode.GemFireIOException: While starting cache server CacheServer
> on port=40404 client subscription config policy=none client subscription
> config capacity=1 client subscription config overflow directory=.
> at
> org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
> at
> org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
> at
> org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
> at
> org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
> at
> org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
> at
> org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
> at
> org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
> at
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.net.BindException: Address already in use (Bind failed)
> at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
> at
> java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
> at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
> at
> org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
> at
> org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
> at
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:573)
> at
> org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
> at
> org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:420)
> at
> org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:377)
> at
> org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:796)
> ... 14 more
> {noformat}
> It seems that the server is trying to bind the port before the old instance
> has finished shutting down and cleaning up resources, causing the reconnect
> process to halt and stop re-trying, and leaving the cluster with one less
> member.
> We've been able to reproduce the problem only twice in the past few weeks,
> I've attached the two set of artefacts to the ticket:
> - _*cluster_logs_pks_121*_: the member that throws the {{BindException}}
> during reconnect is {{gemfire-cluster-server-1}}.
> - _*cluster_logs_gke_latest_54*_: the member that throws the
> {{BindException}} during reconnect is {{gemfire-cluster-server-0}}.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)