[ https://issues.apache.org/jira/browse/SOLR-14897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-14897: -------------------------------------- Fix Version/s: 8.6.3 Priority: Blocker (was: Major) [~munendrasn] - +1 to committing your patch. bq. ... I'm not able figure out way to add test for this change, any help would be appreciated I'm not sure we have any good template/plumbing/helpers for testing this kind of situation ... i have some thoughts on how we might go about it (from an offline idea proposed by AB) that i'll put into a new jira, but i don't think we should let building new test scaffolding for situations like this should solw us down in trying to fix this really heinous bug ASAP. > HttpSolrCall will forward a virtually unlimited number of times until > ClusterState ZkWatcher is updated after collection delete > ------------------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-14897 > URL: https://issues.apache.org/jira/browse/SOLR-14897 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Chris M. Hostetter > Priority: Blocker > Fix For: 8.6.3 > > Attachments: SOLR-14897.patch > > > While investigating the root cause of some SOLR-14896 related failures, I > have seen evidence that if a collection is deleted, but a client makes a > subequent request for that collection _before_ the local ClusterState has > been updated to remove that DocCollection, HttpSolrCall will forward/proxy > that request a (virtually) unbounded number of times in a very short time > period - stopping only once the the "cached" local DocCollection is updated > to indicate there are no active replicas.** > While HttpSolrCall does track & increment a {{_forwardedCount}} param on > every request it forwards, it doesn't consult that request unless/until it > finds a situation where the (local) DocCollection says there are no active > replicas. > So if you have a collection XX with 4 total replicas on 4 diff nodes > (A,B,C,D), and and you delete XX (triggering sequential core deletions on > A,B,C,D that fire successive ZkWatchers on various nodes to update the > collection state) a request for XX can bounce back and forth between nodes C > & D 20+ times until the ClusterState watcher fires on both of those nodes so > they finally realize that the {{_forwardedCount=20}} is more the the 0 active > replicas... > In the below code snippet from HttpSolrCall, the first call to > {{getCoreUrl(...)}} is expected to return null if there are no active > replicas - but it uses the local cached DocCollection, which may _think_ > there is an active replica on another node, so it forwards the request to > that node - where the replica may have been deleted, so that node runs hte > same code and may forward the request right back to the original node.... > {code:java} > String coreUrl = getCoreUrl(collectionName, origCorename, clusterState, > activeSlices, byCoreName, true); > // Avoid getting into a recursive loop of requests being forwarded by > // stopping forwarding and erroring out after (totalReplicas) forwards > if (coreUrl == null) { > if (queryParams.getInt(INTERNAL_REQUEST_COUNT, 0) > totalReplicas){ > throw new SolrException(SolrException.ErrorCode.INVALID_STATE, > "No active replicas found for collection: " + collectionName); > } > coreUrl = getCoreUrl(collectionName, origCorename, clusterState, > activeSlices, byCoreName, false); > } > {code} > ..the check that is suppose to prevent a "recursive loop" is only consulted > once a situation arises where local ClusterState indicates there are no > active replicas - which seems to defeat the point of the forward check? (at > which point if the total number of replicas hasn't been exceeded, the code is > happy to forward the request to a coreUrl which the local ClusterState > indicates is _not_ active (which also sems to defeat the point?) > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org