Which version of solr are you using? Regardless of your env, this is a fail safe that you should not hit.
- Mark > On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen > <h...@entertainment-trading.com> wrote: > > I previously made a post on this, but have since narrowed down the issue and > am now giving this another try, with another spin to it. > > We are running a 4 node setup (over Tomcat7) with a 3-ensemble external > ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM > is using our Storage system (NFS share in VMWare). > > Now I do realize and have heard, that NFS is not the greatest way to run Solr > on, but we have never had this issue on non-SolrCloud setups. > > Basically, each night when we run our backup jobs, our storage becomes a bit > slow in response - this is obviously something we’re trying to solve, but > bottom line is, that all our other systems somehow stays alive or recovers > gracefully when bandwidth exists again. > SolrCloud - not so much. Typically after a session like this, 3-5 nodes will > either go into a Down state or a Recovering state - and stay that way. > Sometimes such node will even be marked as leader. A such node will have > something like this in the log: > > ERROR - 2013-11-05 08:57:45.764; > org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState > says we are the leader, but locally we don't think so > ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; > org.apache.solr.common.SolrException: ClusterState says we are the leader > (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally > we don't think so. Request came from > http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/ > at > org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428) > at > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247) > at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) > at > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) > at > org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) > at > org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) > at > org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) > at > org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > > On the other nodes, an error similar to this will be in the log: > > 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode: > http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > Server at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2 > returned non ok status:503, message:Service Unavailable > 09:27:34 -ERROR - SolrCmdDistributor forwarding update to > http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - > retrying ... > > Does anyone have any ideas or leads towards a solution - one that doesn’t > involve getting a new storage system (a solution we *are* actively working > on, but that’s not a quick fix in our case). Shouldn’t a setup like this be > possible? And even more so - shouldn’t SolrCloud be able to gracefully > recover after issues like this? > > -- > Henrik Ossipoff Hansen > Developer, Entertainment Trading