[jira] [Commented] (SOLR-14897) HttpSolrCall will forward a virtually unlimited number of times until ClusterState ZkWatcher is updated after collection delete

Chris M. Hostetter (Jira) Fri, 25 Sep 2020 18:02:43 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-14897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202474#comment-17202474
 ]


Chris M. Hostetter commented on SOLR-14897:
-------------------------------------------

This excessive forwarding, combined with SOLR-14898 causing response header 
duplication on _every_ forward, is what leads to the the "Response header too 
large" error situation that causes SOLR-14896.

Here's what jetty's {{HttpChannel COMMIT}} debug logging looks like for the 
{{_forwardedCount=23}}
{noformat}
 2020-09-25 06:02:59.377 DEBUG (qtp1800649922-16) [   ] o.e.j.s.HttpChannel 
COMMIT for /solr/sigtest-c7cfa75ce_recs_aggr/select on 
HttpChannelOverHttp@2f4e148{s=HttpChannelState@5f1bd187{s=HANDLING rs
=BLOCKING os=COMMITTED is=IDLE awp=false se=false i=true 
al=0},r=1,c=false/false,a=HANDLING,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=23,
age=60}
404 null HTTP/1.1
Content-Security-Policy: default-src 'none'; base-uri 'none'; connect-src 
'self'; form-action 'self'; font-src 'self'; frame-ancestors 'none'; img-src 
'self'; media-src 'self'; style-src 'self' 'unsa
fe-inline'; script-src 'self'; worker-src 'self';
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Content-Security-Policy: default-src 'none'; base-uri 'none'; connect-src 
'self'; form-action 'self'; font-src 'self'; frame-ancestors 'none'; img-src 
'self'; media-src 'self'; style-src 'self' 'unsa
fe-inline'; script-src 'self'; worker-src 'self';
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Content-Security-Policy: default-src 'none'; base-uri 'none'; connect-src 
'self'; form-action 'self'; font-src 'self'; frame-ancestors 'none'; img-src 
'self'; media-src 'self'; style-src 'self' 'unsa
fe-inline'; script-src 'self'; worker-src 'self';
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Content-Security-Policy: default-src 'none'; base-uri 'none'; connect-src 
'self'; form-action 'self'; font-src 'self'; frame-ancestors 'none'; img-src 
'self'; media-src 'self'; style-src 'self' 'unsa
fe-inline'; script-src 'self'; worker-src 'self';
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Cache-Control: must-revalidate,no-cache,no-store
Content-Type: text/html;charset=iso-8859-1
Content-Length: 397
{noformat}
...as we keep popping off the stack of requests (decreating the forward count, 
they just get longer and longer until they stop being 404 and become 500 
because jetty has decided the response headers are too large to send.

 

 

> HttpSolrCall will forward a virtually unlimited number of times until 
> ClusterState ZkWatcher is updated after collection delete
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-14897
>                 URL: https://issues.apache.org/jira/browse/SOLR-14897
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> While investigating the root cause of some SOLR-14896 related failures, I 
> have seen evidence that if a collection is deleted, but a client makes a 
> subequent request for that collection _before_ the local ClusterState has 
> been updated to remove that DocCollection, HttpSolrCall will forward/proxy 
> that request a (virtually) unbounded number of times in a very short time 
> period - stopping only once the the "cached" local DocCollection is updated 
> to indicate there are no active replicas.**
> While HttpSolrCall does track & increment a {{_forwardedCount}} param on 
> every request it forwards, it doesn't consult that request unless/until it 
> finds a situation where the (local) DocCollection says there are no active 
> replicas.
> So if you have a collection XX with 4 total replicas on 4 diff nodes 
> (A,B,C,D), and and you delete XX (triggering sequential core deletions on 
> A,B,C,D that fire successive ZkWatchers on various nodes to update the 
> collection state) a request for XX can bounce back and forth between nodes C 
> & D 20+ times until the ClusterState watcher fires on both of those nodes so 
> they finally realize that the {{_forwardedCount=20}} is more the the 0 active 
> replicas...
> In the below code snippet from HttpSolrCall, the first call to 
> {{getCoreUrl(...)}} is expected to return null if there are no active 
> replicas - but it uses the local cached DocCollection, which may _think_ 
> there is an active replica on another node, so it forwards the request to 
> that node - where the replica may have been deleted, so that node runs hte 
> same code and may forward the request right back to the original node....
> {code:java}
>     String coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
>         activeSlices, byCoreName, true);
>     // Avoid getting into a recursive loop of requests being forwarded by
>     // stopping forwarding and erroring out after (totalReplicas) forwards
>     if (coreUrl == null) {
>       if (queryParams.getInt(INTERNAL_REQUEST_COUNT, 0) > totalReplicas){
>         throw new SolrException(SolrException.ErrorCode.INVALID_STATE,
>             "No active replicas found for collection: " + collectionName);
>       }
>       coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
>           activeSlices, byCoreName, false);
>     }
> {code}
> ..the check that is suppose to prevent a "recursive loop" is only consulted 
> once a situation arises where local ClusterState indicates there are no 
> active replicas - which seems to defeat the point of the forward check?  (at 
> which point if the total number of replicas hasn't been exceeded, the code is 
> happy to forward the request to a coreUrl which the local ClusterState 
> indicates is _not_ active (which also sems to defeat the point?)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14897) HttpSolrCall will forward a virtually unlimited number of times until ClusterState ZkWatcher is updated after collection delete

Reply via email to