[ 
https://issues.apache.org/jira/browse/SOLR-14897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-14897:
--------------------------------
    Attachment: SOLR-14897.patch

> HttpSolrCall will forward a virtually unlimited number of times until 
> ClusterState ZkWatcher is updated after collection delete
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-14897
>                 URL: https://issues.apache.org/jira/browse/SOLR-14897
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-14897.patch
>
>
> While investigating the root cause of some SOLR-14896 related failures, I 
> have seen evidence that if a collection is deleted, but a client makes a 
> subequent request for that collection _before_ the local ClusterState has 
> been updated to remove that DocCollection, HttpSolrCall will forward/proxy 
> that request a (virtually) unbounded number of times in a very short time 
> period - stopping only once the the "cached" local DocCollection is updated 
> to indicate there are no active replicas.**
> While HttpSolrCall does track & increment a {{_forwardedCount}} param on 
> every request it forwards, it doesn't consult that request unless/until it 
> finds a situation where the (local) DocCollection says there are no active 
> replicas.
> So if you have a collection XX with 4 total replicas on 4 diff nodes 
> (A,B,C,D), and and you delete XX (triggering sequential core deletions on 
> A,B,C,D that fire successive ZkWatchers on various nodes to update the 
> collection state) a request for XX can bounce back and forth between nodes C 
> & D 20+ times until the ClusterState watcher fires on both of those nodes so 
> they finally realize that the {{_forwardedCount=20}} is more the the 0 active 
> replicas...
> In the below code snippet from HttpSolrCall, the first call to 
> {{getCoreUrl(...)}} is expected to return null if there are no active 
> replicas - but it uses the local cached DocCollection, which may _think_ 
> there is an active replica on another node, so it forwards the request to 
> that node - where the replica may have been deleted, so that node runs hte 
> same code and may forward the request right back to the original node....
> {code:java}
>     String coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
>         activeSlices, byCoreName, true);
>     // Avoid getting into a recursive loop of requests being forwarded by
>     // stopping forwarding and erroring out after (totalReplicas) forwards
>     if (coreUrl == null) {
>       if (queryParams.getInt(INTERNAL_REQUEST_COUNT, 0) > totalReplicas){
>         throw new SolrException(SolrException.ErrorCode.INVALID_STATE,
>             "No active replicas found for collection: " + collectionName);
>       }
>       coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
>           activeSlices, byCoreName, false);
>     }
> {code}
> ..the check that is suppose to prevent a "recursive loop" is only consulted 
> once a situation arises where local ClusterState indicates there are no 
> active replicas - which seems to defeat the point of the forward check?  (at 
> which point if the total number of replicas hasn't been exceeded, the code is 
> happy to forward the request to a coreUrl which the local ClusterState 
> indicates is _not_ active (which also sems to defeat the point?)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to