I didn't see the zk timeout you set (just skimmed). But if your Zookeeper was down _very_ termporarily, it may suffice to up the ZK timeout. The default in the 10.4 time-frame (if I remember correctly) was 15 seconds which has proven to be too short in many circumstances.
Of course if your ZK was down for minutest this wouldn't help. Best, Erick On Mon, Aug 10, 2015 at 1:06 PM, danny teichthal <dannyt...@gmail.com> wrote: > Hi Alexander , > Thanks for your reply, I looked at the release notes. > There is one bug fix - SOLR-7503 > <https://issues.apache.org/jira/browse/SOLR-7503> – register cores > asynchronously. > It may reduce the registration time since it is done on parallel, but > still, 3 minutes (leaderVoteWait) is a long time to recover from a few > seconds of disconnection. > > Except from that one I don't see any bug fix that addresses the same > problem. > I am able to reproduce it on 4.10.4 pretty easily, I will also try it with > 5.2.1 and see if it reproduces. > > Anyway, since migrating to 5.2.1 is not an option for me in the short term, > I'm left with the question if reducing leaderVoteWait may help here, and > what may be the consequences. > If i understand correctly, there might be a chance of losing updates that > were made on leader. > From my side it is a lot worse to lose availability for 3 minutes. > > I would really appreciate a feedback on this. > > > > > On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch <arafa...@gmail.com> > wrote: > >> Did you look at release notes for Solr versions after your own? >> >> I am pretty sure some similar things were identified and/or resolved >> for 5.x. It may not help if you cannot migrate, but would at least >> give a confirmation and maybe workaround on what you are facing. >> >> Regards, >> Alex. >> ---- >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: >> http://www.solr-start.com/ >> >> >> On 10 August 2015 at 11:37, danny teichthal <dannyt...@gmail.com> wrote: >> > Hi, >> > We are using Solr cloud with solr 4.10.4. >> > On the passed week we encountered a problem where all of our servers >> > disconnected from zookeeper cluster. >> > This might be ok, the problem is that after reconnecting to zookeeper it >> > looks like for every collection both replicas do not have a leader and >> are >> > stuck in some kind of a deadlock for a few minutes. >> > >> > From what we understand: >> > One of the replicas assume it ill be the leader and at some point >> starting >> > to wait on leaderVoteWait, which is by default 3 minutes. >> > The other replica is stuck on this part of code for a few minutes: >> > at >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957) >> > at >> > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921) >> > at >> > >> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521) >> > at >> > >> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392) >> > >> > Looks like replica 1 waits for a leader to be registered in the >> zookeeper, >> > but replica 2 is waiting for replica 1. >> > >> (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp). >> > >> > We have 100 collections distributed in 3 pairs of Solr nodes. Each >> > collection has one shard with 2 replicas. >> > As I understand from code and logs, all the collections are being >> > registered synchronously, which means that we have to wait 3 minutes * >> > number of collections for the whole cluster to come up. It could be more >> > than an hour! >> > >> > >> > >> > 1. We thought about lowering leaderVoteWait to solve the problem, but we >> > are not sure what is the risk? >> > >> > 2. The following thread is very similar to our case: >> > >> http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down >> . >> > Does anybody know if it is indeed a bug and if there's a related JIRA >> issue? >> > >> > 3. I see this on logs before the reconnection "Client session timed out, >> > have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, >> > closing socket connection and attempting reconnect", does it mean that >> > there was a disconnection of over 50 seconds between SOLR and zookeeper? >> > >> > >> > Thanks in advance for your kind answer >>