SolrCloud never fully recovers after slow disks

Henrik Ossipoff Hansen Tue, 05 Nov 2013 00:33:09 -0800

I previously made a post on this, but have since narrowed down the issue and am 
now giving this another try, with another spin to it.


We are running a 4 node setup (over Tomcat7) with a 3-ensemble external 
ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM is 
using our Storage system (NFS share in VMWare).

Now I do realize and have heard, that NFS is not the greatest way to run Solr 
on, but we have never had this issue on non-SolrCloud setups.

Basically, each night when we run our backup jobs, our storage becomes a bit 
slow in response - this is obviously something we’re trying to solve, but 
bottom line is, that all our other systems somehow stays alive or recovers 
gracefully when bandwidth exists again.
SolrCloud - not so much. Typically after a session like this, 3-5 nodes will 
either go into a Down state or a Recovering state - and stay that way. 
Sometimes such node will even be marked as leader. A such node will have 
something like this in the log:

ERROR - 2013-11-05 08:57:45.764; 
org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says 
we are the leader, but locally we don't think so
ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; 
org.apache.solr.common.SolrException: ClusterState says we are the leader 
(http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally we 
don't think so. Request came from 
http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
        at 
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
        at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
        at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
        at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
        at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
        at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
        at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
        at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
        at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
        at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
        at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
        at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
        at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
        at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
        at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
        at 
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)

On the other nodes, an error similar to this will be in the log:

09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode: 
http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 Server at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2 
returned non ok status:503, message:Service Unavailable
09:27:34 -ERROR - SolrCmdDistributor forwarding update to 
http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - 
retrying ...

Does anyone have any ideas or leads towards a solution - one that doesn’t 
involve getting a new storage system (a solution we *are* actively working on, 
but that’s not a quick fix in our case). Shouldn’t a setup like this be 
possible? And even more so - shouldn’t SolrCloud be able to gracefully recover 
after issues like this?

--
Henrik Ossipoff Hansen
Developer, Entertainment Trading

SolrCloud never fully recovers after slow disks

Reply via email to