Bill Burcham created GEODE-9808:
-----------------------------------

             Summary: Client ops fail with NoLocatorsAvailableException when 
all servers leave the DS 
                 Key: GEODE-9808
                 URL: https://issues.apache.org/jira/browse/GEODE-9808
             Project: Geode
          Issue Type: Bug
          Components: client/server
    Affects Versions: 1.15.0
            Reporter: Bill Burcham


When there are no cache servers (only locators) in a cluster, client operations 
will fail with a misleading exception:
{noformat}
org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect 
to any locators in the list 
[gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334,
 
gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334,
 
gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334]
    at 
org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServer(AutoConnectionSourceImpl.java:174)
    at 
org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:211)
    at 
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:196)
    at 
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:227)
    at 
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.exchangeConnection(ConnectionManagerImpl.java:365)
    at 
org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:161)
    at 
org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:120)
    at 
org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:805)
    at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91)
{noformat}
Even the client is able to connect to a locator, we encounter a 
NoAvailableLocatorsException exception with the message "Unable to connect to 
any locators in the list".

Investigating the product code we see:
 # If there are no cache servers in the cluster, ServerLocator.pickServer() 
will definitely construct a ClientConnectionResponse(null) which causes that 
object’s hasResult() to respond with false in the loop termination in 
AutoConnectionSourceImpl.queryLocators()

 # Not only is the exception wording misleading in 
AutoConnectionSourceImpl.findServer()—it’s also misleading in at least two 
other calling locations in AutoConnectionSourceImpl: findReplacementServer() 
and findServersForQueue().

 # In each of those cases the calling method translates a null response from 
queryLocators() into a throw of a NoAvailableLocatorsException

 # an appropriate exception, NoAvailableServersException, already exists, for 
the case where we were able to contact a locator but the locator was not able 
to find any cache servers

 # According to my Git spelunking queryLocators() has been obfuscating the true 
cause of the failure since at least 2015

Without analyzing ServerLocator.pickServer() 
(LocatorLoadSnapshot.getServerForConnection()) to discern why two locators 
might disagree on how many cache servers are in the cluster, it seems to me 
that we should modify AutoConnectionSourceImpl.queryLocators() so that:
 * if it gets a ServerLocationResponse with hasResult() true, it immediately 
returns that as it does now

 * otherwise it keeps trying and it keeps track of the last (non-null) 
ServerLocationResponse it has received

 * it returns the last non-null ServerLocationResponse it received (otherwise 
it returns null)

With that in hand, we can change the three call locations in 
AutoConnectionSourceImpl: findServer(), findReplacementServer(), and 
findServersForQueue() to each throw NoAvailableLocatorsException if no locator 
responded, or NoAvailableServersException if a locator responded with a 
ClientConnectionResponse for which hasResult() returns null.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to