[ https://issues.apache.org/jira/browse/GEODE-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mark Hanson reassigned GEODE-9425: ---------------------------------- Assignee: (was: Mark Hanson) > AutoConnectionSource thread in client can't query for available locators when > it is connected to a locator that was shut down > ----------------------------------------------------------------------------------------------------------------------------- > > Key: GEODE-9425 > URL: https://issues.apache.org/jira/browse/GEODE-9425 > Project: Geode > Issue Type: Bug > Components: client/server > Affects Versions: 1.15.0 > Reporter: Lynn Gallinat > Priority: Major > > The AutoConnectionSource thread runs in a client and queries the locator that > client is connected to so it can update the list of available locators. > But if the locator the client is connected to was shut down, the client > can't get an updated locator list. > In this case the locator was shut down and is not coming back, but there is > another available locator. > However we can't find out what that available locator is because we can't > complete the query. > To summarize: The AutoConnectionSource thread that runs in a client to update > the list of available locators should be able to get a list of available > locators even when that client is connected to a locator that was shut down. > The AutoConnectionSource thread starts and runs every 10 seconds. This is > from the client's system log. > [info 2021/07/07 19:37:33.723 GMT clientgemfire1_host1_881 > <vm_0_thr_0_client1_host1_881> tid=0x2d] AutoConnectionSource > UpdateLocatorListTask started with interval=10000 ms. > After the locator is shut down the AutoConnectionSource thread can't complete > its work so we get stuck threads. > This stuck thread stack shows it is trying to run UpdateLocatorListTask. > {noformat} > clientgemfire1_881/system.log: [warn 2021/07/07 19:47:25.784 GMT > clientgemfire1_host1_881 <ThreadsMonitor> tid=0x36] Thread <286> (0x11e) that > was executed at <07 Jul 2021 19:46:03 GMT> has been stuck for <82.041 > seconds> and number of thread monitor iteration <1> > Thread Name <poolTimer-pool-24> state <RUNNABLE> > Executor Group <ScheduledThreadPoolExecutorWithKeepAlive> > Monitored metric <ResourceManagerStats.numThreadsStuck> > Thread stack for "poolTimer-pool-24" (0x11e): > java.lang.ThreadState: RUNNABLE (in native) > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > - locked java.net.SocksSocketImpl@3e95a505 > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:607) > at > org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102) > at > org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51) > at > org.apache.geode.distributed.internal.tcpserver.ClusterSocketCreatorImpl.connect(ClusterSocketCreatorImpl.java:96) > at > org.apache.geode.distributed.internal.tcpserver.TcpClient.getServerVersion(TcpClient.java:246) > at > org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:151) > at > org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocatorUsingConnection(AutoConnectionSourceImpl.java:217) > at > org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocator(AutoConnectionSourceImpl.java:207) > at > org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryLocators(AutoConnectionSourceImpl.java:254) > at > org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.access$200(AutoConnectionSourceImpl.java:68) > at > org.apache.geode.cache.client.internal.AutoConnectionSourceImpl$UpdateLocatorListTask.run2(AutoConnectionSourceImpl.java:458) > at > org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1334) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:285) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - java.util.concurrent.ThreadPoolExecutor$Worker@24cd39b5 > {noformat} > Impact on running cache operations: > Any operations in progress by the client connected to a locator that was > shut down can take 59 seconds to complete, which is the default socket > connect timeout. > From org.apache.geode.cache.client.PoolFactory: > int DEFAULT_SOCKET_CONNECT_TIMEOUT = 59000; > If the client's connection pool has multiple locators listed, then we can see > a 59 second timeout for each of those that has been shut down so an operation > could take minutes to complete. > Here is a stack for such a timeout: > {noformat} > clientgemfire1_881/system.log: [info 2021/07/07 19:48:57.936 GMT > clientgemfire1_host1_881 <queueTimer-pool1> tid=0x85] Communication with > locator > gemfire-cluster-locator-3.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.18.108:10334 > failed > java.net.SocketTimeoutException: connect timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:607) > at > org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102) > at > org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51) > at > org.apache.geode.distributed.internal.tcpserver.ClusterSocketCreatorImpl.connect(ClusterSocketCreatorImpl.java:96) > at > org.apache.geode.distributed.internal.tcpserver.TcpClient.getServerVersion(TcpClient.java:246) > at > org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:151) > at > org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocatorUsingConnection(AutoConnectionSourceImpl.java:217) > at > org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocator(AutoConnectionSourceImpl.java:207) > at > org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryLocators(AutoConnectionSourceImpl.java:254) > at > org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServersForQueue(AutoConnectionSourceImpl.java:188) > at > org.apache.geode.cache.client.internal.QueueManagerImpl.findQueueServers(QueueManagerImpl.java:823) > at > org.apache.geode.cache.client.internal.QueueManagerImpl.recoverRedundancy(QueueManagerImpl.java:640) > at > org.apache.geode.cache.client.internal.QueueManagerImpl.access$700(QueueManagerImpl.java:77) > at > org.apache.geode.cache.client.internal.QueueManagerImpl$RedundancySatisfierTask.run2(QueueManagerImpl.java:1449) > at > org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1334) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > Then 59 seconds later there can be another timeout while the client tries to > talk to a different locator that was shut down. > Notes on how to reproduce: > Need servers and at least 2 locators. > Need a client connected to a locator. > This is the client's pool configuration from my example where there are 4 > locators to start with, and all are running. > {noformat} > [info 2021/07/07 19:37:33.715 GMT <vm_0_thr_0_client1_host1_881> tid=0x2d] > Creating pool named: pool with attributes > freeConnectionTimeout: 10000 > idleTimeout: 5000 > loadConditioningInterval: 5000 > locators: > [gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.17.239:10334, > > gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.22.157:10334, > > gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.10.237:10334, > > gemfire-cluster-locator-3.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.18.108:10334] > minConnections: 2 > maxConnections: -1 > multiuserAuthentication: false > pingInterval: 10000 > poolName: pool > prSingleHopEnabled: true > readTimeout: 3600000 > retryAttempts: -1 > servers: [] > serverGroup: > socketBufferSize: 32768 > statisticInterval: -1 > subscriptionAckInterval: 100 > subscriptionEnabled: true > subscriptionMessageTrackingTimeout: 900000 > subscriptionRedundancy: 1 > {noformat} > After a client connects to the locator, that locator is shut down. > In this example, 3 locators were shutdown so the only one left was > gemfire-cluster-locator-0. > Note that the view in the system log for gemfire-cluster-locator-0 shows > that it knows there is now only one locator. > {noformat} > [info 2021/07/07 19:38:04.420 GMT gemfire-cluster-locator-0 <Geode Membership > View Creator> tid=0xc8] sending new view > View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v2>:41000|16] > members: > [gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v2>:41000, > gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v3>:41000{lead}, > gemfire-cluster-server-5(gemfire-cluster-server-5:1)<v4>:41000, > gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v6>:41000, > gemfire-cluster-server-4(gemfire-cluster-server-4:1)<v7>:41000, > gemfire-cluster-server-0(gemfire-cluster-server-0:1)<v8>:41000, > gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v9>:41000] shutdown: > [gemfire-cluster-server-6(gemfire-cluster-server-6:1)<v5>:41000] > {noformat} > After an appropriate amount of time the stuck thread stack appears in the > client's system.log. > If operations were in progress in that client, they timeout after 59 seconds > and try another locator, possibly another stopped locator and then it waits > another 59 seconds. > This can repeat for each locator that was shut down making an operation take > minutes. -- This message was sent by Atlassian Jira (v8.20.1#820001)