I previously made a post on this, but have since narrowed down the issue and am now giving this another try, with another spin to it.
We are running a 4 node setup (over Tomcat7) with a 3-ensemble external ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM is using our Storage system (NFS share in VMWare). Now I do realize and have heard, that NFS is not the greatest way to run Solr on, but we have never had this issue on non-SolrCloud setups. Basically, each night when we run our backup jobs, our storage becomes a bit slow in response - this is obviously something we’re trying to solve, but bottom line is, that all our other systems somehow stays alive or recovers gracefully when bandwidth exists again. SolrCloud - not so much. Typically after a session like this, 3-5 nodes will either go into a Down state or a Recovering state - and stay that way. Sometimes such node will even be marked as leader. A such node will have something like this in the log: ERROR - 2013-11-05 08:57:45.764; org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ClusterState says we are the leader (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally we don't think so. Request came from http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/ at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381) at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) On the other nodes, an error similar to this will be in the log: 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode: http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2 returned non ok status:503, message:Service Unavailable 09:27:34 -ERROR - SolrCmdDistributor forwarding update to http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - retrying ... Does anyone have any ideas or leads towards a solution - one that doesn’t involve getting a new storage system (a solution we *are* actively working on, but that’s not a quick fix in our case). Shouldn’t a setup like this be possible? And even more so - shouldn’t SolrCloud be able to gracefully recover after issues like this? -- Henrik Ossipoff Hansen Developer, Entertainment Trading