Which version of solr are you using? Regardless of your env, this is a fail 
safe that you should not hit. 

- Mark

> On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen 
> <h...@entertainment-trading.com> wrote:
> 
> I previously made a post on this, but have since narrowed down the issue and 
> am now giving this another try, with another spin to it.
> 
> We are running a 4 node setup (over Tomcat7) with a 3-ensemble external 
> ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM 
> is using our Storage system (NFS share in VMWare).
> 
> Now I do realize and have heard, that NFS is not the greatest way to run Solr 
> on, but we have never had this issue on non-SolrCloud setups.
> 
> Basically, each night when we run our backup jobs, our storage becomes a bit 
> slow in response - this is obviously something we’re trying to solve, but 
> bottom line is, that all our other systems somehow stays alive or recovers 
> gracefully when bandwidth exists again.
> SolrCloud - not so much. Typically after a session like this, 3-5 nodes will 
> either go into a Down state or a Recovering state - and stay that way. 
> Sometimes such node will even be marked as leader. A such node will have 
> something like this in the log:
> 
> ERROR - 2013-11-05 08:57:45.764; 
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState 
> says we are the leader, but locally we don't think so
> ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; 
> org.apache.solr.common.SolrException: ClusterState says we are the leader 
> (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally 
> we don't think so. Request came from 
> http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
>        at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
>        at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
>        at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
>        at 
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
>        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
>        at 
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>        at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>        at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
>        at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>        at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>        at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
>        at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
>        at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
>        at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
>        at 
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
>        at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>        at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
>        at 
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
>        at 
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
>        at 
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
>        at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>        at java.lang.Thread.run(Thread.java:724)
> 
> On the other nodes, an error similar to this will be in the log:
> 
> 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode: 
> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>  Server at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2 
> returned non ok status:503,​ message:Service Unavailable
> 09:27:34 -ERROR - SolrCmdDistributor forwarding update to 
> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - 
> retrying ...
> 
> Does anyone have any ideas or leads towards a solution - one that doesn’t 
> involve getting a new storage system (a solution we *are* actively working 
> on, but that’s not a quick fix in our case). Shouldn’t a setup like this be 
> possible? And even more so - shouldn’t SolrCloud be able to gracefully 
> recover after issues like this?
> 
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading

Reply via email to