Bill Burcham created GEODE-9808:
-----------------------------------
Summary: Client ops fail with NoLocatorsAvailableException when
all servers leave the DS
Key: GEODE-9808
URL: https://issues.apache.org/jira/browse/GEODE-9808
Project: Geode
Issue Type: Bug
Components: client/server
Affects Versions: 1.15.0
Reporter: Bill Burcham
When there are no cache servers (only locators) in a cluster, client operations
will fail with a misleading exception:
{noformat}
org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect
to any locators in the list
[gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334,
gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334,
gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334]
at
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServer(AutoConnectionSourceImpl.java:174)
at
org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:211)
at
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:196)
at
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:227)
at
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.exchangeConnection(ConnectionManagerImpl.java:365)
at
org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:161)
at
org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:120)
at
org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:805)
at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91)
{noformat}
Even the client is able to connect to a locator, we encounter a
NoAvailableLocatorsException exception with the message "Unable to connect to
any locators in the list".
Investigating the product code we see:
# If there are no cache servers in the cluster, ServerLocator.pickServer()
will definitely construct a ClientConnectionResponse(null) which causes that
object’s hasResult() to respond with false in the loop termination in
AutoConnectionSourceImpl.queryLocators()
# Not only is the exception wording misleading in
AutoConnectionSourceImpl.findServer()—it’s also misleading in at least two
other calling locations in AutoConnectionSourceImpl: findReplacementServer()
and findServersForQueue().
# In each of those cases the calling method translates a null response from
queryLocators() into a throw of a NoAvailableLocatorsException
# an appropriate exception, NoAvailableServersException, already exists, for
the case where we were able to contact a locator but the locator was not able
to find any cache servers
# According to my Git spelunking queryLocators() has been obfuscating the true
cause of the failure since at least 2015
Without analyzing ServerLocator.pickServer()
(LocatorLoadSnapshot.getServerForConnection()) to discern why two locators
might disagree on how many cache servers are in the cluster, it seems to me
that we should modify AutoConnectionSourceImpl.queryLocators() so that:
* if it gets a ServerLocationResponse with hasResult() true, it immediately
returns that as it does now
* otherwise it keeps trying and it keeps track of the last (non-null)
ServerLocationResponse it has received
* it returns the last non-null ServerLocationResponse it received (otherwise
it returns null)
With that in hand, we can change the three call locations in
AutoConnectionSourceImpl: findServer(), findReplacementServer(), and
findServersForQueue() to each throw NoAvailableLocatorsException if no locator
responded, or NoAvailableServersException if a locator responded with a
ClientConnectionResponse for which hasResult() returns null.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)