Chris M. Hostetter created SOLR-14897:
-----------------------------------------

             Summary: HttpSolrCall will forward a virtually unlimited number of 
times until ClusterState ZkWatcher is updated after collection delete
                 Key: SOLR-14897
                 URL: https://issues.apache.org/jira/browse/SOLR-14897
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Chris M. Hostetter


While investigating the root cause of some SOLR-14896 related failures, I have 
seen evidence that if a collection is deleted, but a client makes a subequent 
request for that collection _before_ the local ClusterState has been updated to 
remove that DocCollection, HttpSolrCall will forward/proxy that request a 
(virtually) unbounded number of times in a very short time period - stopping 
only once the the "cached" local DocCollection is updated to indicate there are 
no active replicas.**

While HttpSolrCall does track & increment a {{_forwardedCount}} param on every 
request it forwards, it doesn't consult that request unless/until it finds a 
situation where the (local) DocCollection says there are no active replicas.

So if you have a collection XX with 4 total replicas on 4 diff nodes (A,B,C,D), 
and and you delete XX (triggering sequential core deletions on A,B,C,D that 
fire successive ZkWatchers on various nodes to update the collection state) a 
request for XX can bounce back and forth between nodes C & D 20+ times until 
the ClusterState watcher fires on both of those nodes so they finally realize 
that the {{_forwardedCount=20}} is more the the 0 active replicas...

In the below code snippet from HttpSolrCall, the first call to 
{{getCoreUrl(...)}} is expected to return null if there are no active replicas 
- but it uses the local cached DocCollection, which may _think_ there is an 
active replica on another node, so it forwards the request to that node - where 
the replica may have been deleted, so that node runs hte same code and may 
forward the request right back to the original node....
{code:java}
    String coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
        activeSlices, byCoreName, true);

    // Avoid getting into a recursive loop of requests being forwarded by
    // stopping forwarding and erroring out after (totalReplicas) forwards
    if (coreUrl == null) {
      if (queryParams.getInt(INTERNAL_REQUEST_COUNT, 0) > totalReplicas){
        throw new SolrException(SolrException.ErrorCode.INVALID_STATE,
            "No active replicas found for collection: " + collectionName);
      }
      coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
          activeSlices, byCoreName, false);
    }
{code}
..the check that is suppose to prevent a "recursive loop" is only consulted 
once a situation arises where local ClusterState indicates there are no active 
replicas - which seems to defeat the point of the forward check?  (at which 
point if the total number of replicas hasn't been exceeded, the code is happy 
to forward the request to a coreUrl which the local ClusterState indicates is 
_not_ active (which also sems to defeat the point?)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to