Lynn Gallinat created GEODE-9425: ------------------------------------ Summary: AutoConnectionSource thread in client can't query for available locators when it is connected to a locator that was shut down Key: GEODE-9425 URL: https://issues.apache.org/jira/browse/GEODE-9425 Project: Geode Issue Type: Bug Components: client/server Affects Versions: 1.15.0 Reporter: Lynn Gallinat
The AutoConnectionSource thread runs in a client and queries the locator that client is connected to so it can update the list of available locators. But if the locator the client is connected to was shut down, the client can't get an updated locator list. In this case the locator was shut down and is not coming back, but there is another available locator. However we can't find out what that available locator is because we can't complete the query. To summarize: The AutoConnectionSource thread that runs in a client to update the list of available locators should be able to get a list of available locators even when that client is connected to a locator that was shut down. The AutoConnectionSource thread starts and runs every 10 seconds. This is from the client's system log. [info 2021/07/07 19:37:33.723 GMT clientgemfire1_host1_881 <vm_0_thr_0_client1_host1_881> tid=0x2d] AutoConnectionSource UpdateLocatorListTask started with interval=10000 ms. After the locator is shut down the AutoConnectionSource thread can't complete its work so we get stuck threads. This stuck thread stack shows it is trying to run UpdateLocatorListTask. {noformat} clientgemfire1_881/system.log: [warn 2021/07/07 19:47:25.784 GMT clientgemfire1_host1_881 <ThreadsMonitor> tid=0x36] Thread <286> (0x11e) that was executed at <07 Jul 2021 19:46:03 GMT> has been stuck for <82.041 seconds> and number of thread monitor iteration <1> Thread Name <poolTimer-pool-24> state <RUNNABLE> Executor Group <ScheduledThreadPoolExecutorWithKeepAlive> Monitored metric <ResourceManagerStats.numThreadsStuck> Thread stack for "poolTimer-pool-24" (0x11e): java.lang.ThreadState: RUNNABLE (in native) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) - locked java.net.SocksSocketImpl@3e95a505 at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:607) at org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102) at org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51) at org.apache.geode.distributed.internal.tcpserver.ClusterSocketCreatorImpl.connect(ClusterSocketCreatorImpl.java:96) at org.apache.geode.distributed.internal.tcpserver.TcpClient.getServerVersion(TcpClient.java:246) at org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:151) at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocatorUsingConnection(AutoConnectionSourceImpl.java:217) at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocator(AutoConnectionSourceImpl.java:207) at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryLocators(AutoConnectionSourceImpl.java:254) at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.access$200(AutoConnectionSourceImpl.java:68) at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl$UpdateLocatorListTask.run2(AutoConnectionSourceImpl.java:458) at org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1334) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:285) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - java.util.concurrent.ThreadPoolExecutor$Worker@24cd39b5 {noformat} Impact on running cache operations: Any operations in progress by the client connected to a locator that was shut down can take 59 seconds to complete, which is the default socket connect timeout. From org.apache.geode.cache.client.PoolFactory: int DEFAULT_SOCKET_CONNECT_TIMEOUT = 59000; If the client's connection pool has multiple locators listed, then we can see a 59 second timeout for each of those that has been shut down so an operation could take minutes to complete. Here is a stack for such a timeout: {noformat} clientgemfire1_881/system.log: [info 2021/07/07 19:48:57.936 GMT clientgemfire1_host1_881 <queueTimer-pool1> tid=0x85] Communication with locator gemfire-cluster-locator-3.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.18.108:10334 failed java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:607) at org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102) at org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51) at org.apache.geode.distributed.internal.tcpserver.ClusterSocketCreatorImpl.connect(ClusterSocketCreatorImpl.java:96) at org.apache.geode.distributed.internal.tcpserver.TcpClient.getServerVersion(TcpClient.java:246) at org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:151) at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocatorUsingConnection(AutoConnectionSourceImpl.java:217) at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocator(AutoConnectionSourceImpl.java:207) at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryLocators(AutoConnectionSourceImpl.java:254) at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServersForQueue(AutoConnectionSourceImpl.java:188) at org.apache.geode.cache.client.internal.QueueManagerImpl.findQueueServers(QueueManagerImpl.java:823) at org.apache.geode.cache.client.internal.QueueManagerImpl.recoverRedundancy(QueueManagerImpl.java:640) at org.apache.geode.cache.client.internal.QueueManagerImpl.access$700(QueueManagerImpl.java:77) at org.apache.geode.cache.client.internal.QueueManagerImpl$RedundancySatisfierTask.run2(QueueManagerImpl.java:1449) at org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1334) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} Then 59 seconds later there can be another timeout while the client tries to talk to a different locator that was shut down. Notes on how to reproduce: Need servers and at least 2 locators. Need a client connected to a locator. This is the client's pool configuration from my example where there are 4 locators to start with, and all are running. {noformat} [info 2021/07/07 19:37:33.715 GMT <vm_0_thr_0_client1_host1_881> tid=0x2d] Creating pool named: pool with attributes freeConnectionTimeout: 10000 idleTimeout: 5000 loadConditioningInterval: 5000 locators: [gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.17.239:10334, gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.22.157:10334, gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.10.237:10334, gemfire-cluster-locator-3.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.18.108:10334] minConnections: 2 maxConnections: -1 multiuserAuthentication: false pingInterval: 10000 poolName: pool prSingleHopEnabled: true readTimeout: 3600000 retryAttempts: -1 servers: [] serverGroup: socketBufferSize: 32768 statisticInterval: -1 subscriptionAckInterval: 100 subscriptionEnabled: true subscriptionMessageTrackingTimeout: 900000 subscriptionRedundancy: 1 {noformat} After a client connects to the locator, that locator is shut down. In this example, 3 locators were shutdown so the only one left was gemfire-cluster-locator-0. Note that the view in the system log for gemfire-cluster-locator-0 shows that it knows there is now only one locator. {noformat} [info 2021/07/07 19:38:04.420 GMT gemfire-cluster-locator-0 <Geode Membership View Creator> tid=0xc8] sending new view View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v2>:41000|16] members: [gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v2>:41000, gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v3>:41000{lead}, gemfire-cluster-server-5(gemfire-cluster-server-5:1)<v4>:41000, gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v6>:41000, gemfire-cluster-server-4(gemfire-cluster-server-4:1)<v7>:41000, gemfire-cluster-server-0(gemfire-cluster-server-0:1)<v8>:41000, gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v9>:41000] shutdown: [gemfire-cluster-server-6(gemfire-cluster-server-6:1)<v5>:41000] {noformat} After an appropriate amount of time the stuck thread stack appears in the client's system.log. If operations were in progress in that client, they timeout after 59 seconds and try another locator, possibly another stopped locator and then it waits another 59 seconds. This can repeat for each locator that was shut down making an operation take minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005)