Lynn Gallinat created GEODE-9425:
------------------------------------

             Summary: AutoConnectionSource thread in client can't query for 
available locators when it is connected to a locator that was shut down
                 Key: GEODE-9425
                 URL: https://issues.apache.org/jira/browse/GEODE-9425
             Project: Geode
          Issue Type: Bug
          Components: client/server
    Affects Versions: 1.15.0
            Reporter: Lynn Gallinat


The AutoConnectionSource thread runs in a client and queries the locator that 
client is connected to so it can update the list of available locators.
 But if the locator the client is connected to was shut down, the client can't 
get an updated locator list.
 In this case the locator was shut down and is not coming back, but there is 
another available locator.
 However we can't find out what that available locator is because we can't 
complete the query.

To summarize: The AutoConnectionSource thread that runs in a client to update 
the list of available locators should be able to get a list of available 
locators even when that client is connected to a locator that was shut down.

The AutoConnectionSource thread starts and runs every 10 seconds. This is from 
the client's system log.
 [info 2021/07/07 19:37:33.723 GMT clientgemfire1_host1_881 
<vm_0_thr_0_client1_host1_881> tid=0x2d] AutoConnectionSource 
UpdateLocatorListTask started with interval=10000 ms.

After the locator is shut down the AutoConnectionSource thread can't complete 
its work so we get stuck threads.

This stuck thread stack shows it is trying to run UpdateLocatorListTask.
{noformat}
clientgemfire1_881/system.log: [warn 2021/07/07 19:47:25.784 GMT 
clientgemfire1_host1_881 <ThreadsMonitor> tid=0x36] Thread <286> (0x11e) that 
was executed at <07 Jul 2021 19:46:03 GMT> has been stuck for <82.041 seconds> 
and number of thread monitor iteration <1>
Thread Name <poolTimer-pool-24> state <RUNNABLE>
Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack for "poolTimer-pool-24" (0x11e):
java.lang.ThreadState: RUNNABLE (in native)
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    - locked java.net.SocksSocketImpl@3e95a505
  at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
  at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
  at java.net.Socket.connect(Socket.java:607)
  at 
org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102)
  at 
org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51)
  at 
org.apache.geode.distributed.internal.tcpserver.ClusterSocketCreatorImpl.connect(ClusterSocketCreatorImpl.java:96)
  at 
org.apache.geode.distributed.internal.tcpserver.TcpClient.getServerVersion(TcpClient.java:246)
  at 
org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:151)
  at 
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocatorUsingConnection(AutoConnectionSourceImpl.java:217)
  at 
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocator(AutoConnectionSourceImpl.java:207)
  at 
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryLocators(AutoConnectionSourceImpl.java:254)
  at 
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.access$200(AutoConnectionSourceImpl.java:68)
  at 
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl$UpdateLocatorListTask.run2(AutoConnectionSourceImpl.java:458)
  at 
org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1334)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
  at 
org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:285)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
  - java.util.concurrent.ThreadPoolExecutor$Worker@24cd39b5
{noformat}
Impact on running cache operations:
 Any operations in progress by the client connected to a locator that was shut 
down can take 59 seconds to complete, which is the default socket connect 
timeout.
 From org.apache.geode.cache.client.PoolFactory:
 int DEFAULT_SOCKET_CONNECT_TIMEOUT = 59000;

If the client's connection pool has multiple locators listed, then we can see a 
59 second timeout for each of those that has been shut down so an operation 
could take minutes to complete.
 Here is a stack for such a timeout:
{noformat}
clientgemfire1_881/system.log: [info 2021/07/07 19:48:57.936 GMT 
clientgemfire1_host1_881 <queueTimer-pool1> tid=0x85] Communication with 
locator 
gemfire-cluster-locator-3.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.18.108:10334
 failed
java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:607)
        at 
org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102)
        at 
org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51)
        at 
org.apache.geode.distributed.internal.tcpserver.ClusterSocketCreatorImpl.connect(ClusterSocketCreatorImpl.java:96)
        at 
org.apache.geode.distributed.internal.tcpserver.TcpClient.getServerVersion(TcpClient.java:246)
        at 
org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:151)
        at 
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocatorUsingConnection(AutoConnectionSourceImpl.java:217)
        at 
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocator(AutoConnectionSourceImpl.java:207)
        at 
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryLocators(AutoConnectionSourceImpl.java:254)
        at 
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServersForQueue(AutoConnectionSourceImpl.java:188)
        at 
org.apache.geode.cache.client.internal.QueueManagerImpl.findQueueServers(QueueManagerImpl.java:823)
        at 
org.apache.geode.cache.client.internal.QueueManagerImpl.recoverRedundancy(QueueManagerImpl.java:640)
        at 
org.apache.geode.cache.client.internal.QueueManagerImpl.access$700(QueueManagerImpl.java:77)
        at 
org.apache.geode.cache.client.internal.QueueManagerImpl$RedundancySatisfierTask.run2(QueueManagerImpl.java:1449)
        at 
org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1334)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{noformat}
Then 59 seconds later there can be another timeout while the client tries to 
talk to a different locator that was shut down.

Notes on how to reproduce:
 Need servers and at least 2 locators.
 Need a client connected to a locator.

This is the client's pool configuration from my example where there are 4 
locators to start with, and all are running.
{noformat}
[info 2021/07/07 19:37:33.715 GMT <vm_0_thr_0_client1_host1_881> tid=0x2d] 
Creating pool named: pool with attributes
    freeConnectionTimeout: 10000
    idleTimeout: 5000
    loadConditioningInterval: 5000
    locators: 
[gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.17.239:10334,
 
gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.22.157:10334,
 
gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.10.237:10334,
 
gemfire-cluster-locator-3.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.18.108:10334]
    minConnections: 2
    maxConnections: -1
    multiuserAuthentication: false
    pingInterval: 10000
    poolName: pool
    prSingleHopEnabled: true
    readTimeout: 3600000
    retryAttempts: -1
    servers: []
    serverGroup:
    socketBufferSize: 32768
    statisticInterval: -1
    subscriptionAckInterval: 100
    subscriptionEnabled: true
    subscriptionMessageTrackingTimeout: 900000
    subscriptionRedundancy: 1
{noformat}
After a client connects to the locator, that locator is shut down.

In this example, 3 locators were shutdown so the only one left was 
gemfire-cluster-locator-0.
 Note that the view in the system log for gemfire-cluster-locator-0 shows that 
it knows there is now only one locator.
{noformat}
[info 2021/07/07 19:38:04.420 GMT gemfire-cluster-locator-0 <Geode Membership 
View Creator> tid=0xc8] sending new view 
View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v2>:41000|16]
 members: 
[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v2>:41000, 
gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v3>:41000{lead}, 
gemfire-cluster-server-5(gemfire-cluster-server-5:1)<v4>:41000, 
gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v6>:41000, 
gemfire-cluster-server-4(gemfire-cluster-server-4:1)<v7>:41000, 
gemfire-cluster-server-0(gemfire-cluster-server-0:1)<v8>:41000, 
gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v9>:41000] shutdown: 
[gemfire-cluster-server-6(gemfire-cluster-server-6:1)<v5>:41000]
{noformat}
After an appropriate amount of time the stuck thread stack appears in the 
client's system.log.

If operations were in progress in that client, they timeout after 59 seconds 
and try another locator, possibly another stopped locator and then it waits 
another 59 seconds.

This can repeat for each locator that was shut down making an operation take 
minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to