[ 
https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526130#comment-17526130
 ] 

Bill Burcham commented on GEODE-9402:
-------------------------------------

Here’s a draft PR with my experiments: 
[https://github.com/apache/geode/pull/7614]

(In my testing I enabled TLS for all components. I don’t think it matters for 
this ticket but it’s become a habit.)

I wrote a test that starts a three-member cluster and then binds a server 
socket to port X and then calls geode.cache.Cache.addCacheServer() to create a 
CacheServer and then calls setPort(X) on it and then start(). Here’s the 
exception I get:

{{BGB caught: java.net.BindException: Address already in use (Bind failed)
    at java.net.PlainSocketImpl.socketBind(Native Method)
    at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
    at java.net.ServerSocket.bind(ServerSocket.java:390)
    at 
org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:79)
    at 
org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:491)
    at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:574)
    at 
org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
    at 
org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:421)
    at 
org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:378)
    at 
org.apache.geode.cache30.ReconnectWithTlsAndClientsCacheServerDistributedTest.startClientsCacheServer(ReconnectWithTlsAndClientsCacheServerDistributedTest.java:126)
    at 
org.apache.geode.cache30.ReconnectWithTlsAndClientsCacheServerDistributedTest.disconnectAndReconnectTest(ReconnectWithTlsAndClientsCacheServerDistributedTest.java:105)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}}

Part of that stack trace, from the exception to CacheServerImpl.start matches 
the stack trace from GEM-3359. The test does not create the cache from cache 
XML (e.g. ClusterConfigurationLoader.applyClusterXmlConfiguration()) as 
described in the ticket however. *This may be an area we want to explore 
further.*

By explicitly causing the bind exception (in my new 
preBindToClientsCacheServerPortTest() test) I can see that the AcceptorImpl 
constructor is retrying when it encounters the BindException (a 
SocketException). It’ll repeatedly try to create the server socket for 120 
seconds (CacheServerImpl.getTimeLimitMillis()), sleeping 1 second in between 
tries. This is also true of the code path described by the stack trace in the 
ticket.

Calling ServerSocket.setReuseAddress(true) when I bind to port X, does not 
eliminate the bind exception. From the documentation:

Enabling SO_REUSEADDR prior to binding the socket using bind(SocketAddress) 
allows the socket to be bound even though a previous connection is in a timeout 
state.

This setting only allows something else to bind to the port when the original 
socket is in the timeout state. A socket not in the timeout state, bound to a 
port, simply monopolizes that port. The short of it is that 
setReuseAddress(true) is helpful for addressing certain race conditions but it 
can’t address them all.

I did confirm that Geode does always call setReuseAddress(true) whenever 
creating a server socket for a SocketCreator:

non-TLS case:

SocketCreator.createServerSocket()

TLS case:

SCClusterSocketCreator.createServerSocket()

I’ve got a test (disconnectAndReconnectTest()) that enables TLS for all Geode 
components (including clients) and creates a three-member cluster. Then it 
repeatedly starts a client’s CacheServer (bound to port X), crashes the 
distributed system via MembershipManagerHelper.crashDistributedSystem() and 
verifies that the disconnected member reconnects. I haven’t been able to 
reproduce the problem with this test.

This is not exactly the way the forced-disconnect was generated in GEM-3359. In 
that case a network partition caused the forced-disconnection. *This may be an 
area we want to explore further.*

Searching for asynchrony that could lead to a race condition I took a look at 
GMSMembership.ManagerImpl.forceDisconnect(). When that calls 
uncleanShutdownDS() a thread is spawned to do the actual work of shutting down 
the distributed system. Inserting at 30 second delay at the start of that 
thread’s task (run()) did not reproduce GEM-3359.

The path from uncleanShutdownDS() that actually leads to closing the client’s 
CacheServer’s ServerSocket can be seen in this stack trace:

{{BGB in AcceptorImpl.close() closing server socket bound to port: 20009, 
java.lang.Throwable
    at 
org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.close(AcceptorImpl.java:1617)
    at 
org.apache.geode.internal.cache.CacheServerImpl.stop(CacheServerImpl.java:485)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.stopServers(GemFireCacheImpl.java:2671)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.doClose(GemFireCacheImpl.java:2261)
    at 
org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2149)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1545)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2552)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2415)
    at 
org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1261)
    at 
org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2329)
    at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1190)
    at 
org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$uncleanShutdownDS$0(GMSMembership.java:1800)
    at java.lang.Thread.run(Thread.java:748)}}

Once the async task spawned by uncleanShutdownDS() runs, there is no other 
asynchrony involved. The server socket gets closed before the reconnect attempt 
is made.

All this makes me wonder if something outside of Geode is binding to the port 
between AcceptorImpl.close() and the subsequent attempt to bind a new socket to 
that port in CacheServer.start().

> Automatic Reconnect Failure: Address already in use
> ---------------------------------------------------
>
>                 Key: GEODE-9402
>                 URL: https://issues.apache.org/jira/browse/GEODE-9402
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>            Reporter: Juan Ramos
>            Assignee: Bill Burcham
>            Priority: Major
>         Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip
>
>
> There are 2 locators and 4 servers during the test, once they're all up and 
> running the test drops the network connectivity between all members to 
> generate a full network partition and cause all members to shutdown and go 
> into reconnect mode. Upon reaching the mentioned state, the test 
> automatically restores the network connectivity and expects all members to 
> automatically go up again and re-form the distributed system.
>  This works fine most of the time, and we see every member successfully 
> reconnecting to the distributed system:
> {noformat}
> [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 <ReconnectThread> 
> tid=0x87] Reconnect completed.
> [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 <ReconnectThread> 
> tid=0x86] Reconnect completed.
> [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 <ReconnectThread> 
> tid=0x94] Reconnect completed.
> [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 <ReconnectThread> 
> tid=0x96] Reconnect completed.
> [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 <ReconnectThread> 
> tid=0x97] Reconnect completed.
> [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 <ReconnectThread> 
> tid=0x95] Reconnect completed.
> {noformat}
> In some rare occasions, though, one of the servers fails during the reconnect 
> phase with the following exception:
> {noformat}
> [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 <ReconnectThread> 
> tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = 
> false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server 
> = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed 
> because:
> org.apache.geode.GemFireIOException: While starting cache server CacheServer 
> on port=40404 client subscription config policy=none client subscription 
> config capacity=1 client subscription config overflow directory=.
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
>       at 
> org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
>       at 
> org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.net.BindException: Address already in use (Bind failed)
>       at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
>       at 
> java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
>       at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
>       at 
> org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
>       at 
> org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
>       at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:573)
>       at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
>       at 
> org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:420)
>       at 
> org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:377)
>       at 
> org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:796)
>       ... 14 more
> {noformat}
> It seems that the server is trying to bind the port before the old instance 
> has finished shutting down and cleaning up resources, causing the reconnect 
> process to halt and stop re-trying, and leaving the cluster with one less 
> member.
> We've been able to reproduce the problem only twice in the past few weeks, 
> I've attached the two set of artefacts to the ticket:
>  - _*cluster_logs_pks_121*_: the member that throws the {{BindException}} 
> during reconnect is {{gemfire-cluster-server-1}}.
>  - _*cluster_logs_gke_latest_54*_: the member that throws the 
> {{BindException}} during reconnect is {{gemfire-cluster-server-0}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to