I figured out that most of the startup time seems to spent on waiting for replicas to recover. It waits from 6 seconds all the way upto 600 seconds for replicas to recover before trying again and sometimes it succeeds and otherwise it marks the core as down. Is there a way to reduce the timeout while recovery ? Also can anyone explain why the recovery takes so long ? Cant it mark itself as the leader and not wait for some replica to be available?
*Logs*: ERROR - 2014-03-22 19:34:07.852; org.apache.solr.common.SolrException; Error while trying to recover. core=testcollection_shard5_replica1:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was asked to wait on state recovering for 10.1.1.100:8983_solr but I still do not see the requested state. I see state: active live:true at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:402) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:202) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:346) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223) ERROR - 2014-03-22 19:34:07.853; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (6) core= testcollection_shard5_replica1 INFO - 2014-03-22 19:34:07.853; org.apache.solr.cloud.RecoveryStrategy; Wait 128.0 seconds before trying to recover again (7) On Fri, Mar 21, 2014 at 1:05 PM, Chris W <chris1980....@gmail.com> wrote: > Sorry for the piecemeal approach but had another question. I have a 3 zk > ensemble. Does making 2 zk as observer roles help speed up bootup of solr > (due to decrease in time it takes to decide leaders for shards)? > > > On Fri, Mar 21, 2014 at 11:49 AM, Chris W <chris1980....@gmail.com> wrote: > >> Thanks Tim. I would definitely try that next time. I have seen a few >> instances where the overseer_queue not getting processed but that looks >> like an existing bug which got fixed in 4.6 (overseer doesnt process >> requests when reload collection fails) >> >> One question: Assuming our cluster can tolerate downtime of about 10-15 >> minutes, is it ok to restart all solrnodes at the same time? or will there >> be race conditions while recovery? >> >> >> >> >> On Fri, Mar 21, 2014 at 11:08 AM, Mark Miller <markrmil...@gmail.com>wrote: >> >>> >>> On March 21, 2014 at 1:46:13 PM, Tim Potter (tim.pot...@lucidworks.com) >>> wrote: >>> >>> We've seen instances where you end up restarting the overseer node each >>> time as you restart the cluster, which causes all kinds of craziness. >>> >>> >>> That would be a great test to add tot he suite. >>> >>> -- >>> Mark Miller >>> about.me/markrmiller >>> >>> >> >> >> -- >> Best >> -- >> C >> > > > > -- > Best > -- > C > -- Best -- C