Not that I know of. With ZK as the "one source of truth", dropping below quorum is Really Serious, so having to wait 3 minutes or so for action to be taken is the fallback.
Best, Erick On Mon, Aug 10, 2015 at 1:34 PM, danny teichthal <dannyt...@gmail.com> wrote: > Erick, I assume you are referring to zkClientTimeout, it is set to 30 > seconds. I also see these messages on Solr side: > "Client session timed out, have not heard from server in 48865ms for > sessionid 0x44efbb91b5f0001, closing socket connection and attempting > reconnect". > So, I'm not sure what was the actual disconnection duration time, but it > could have been up to a minute. > We are working on finding the network issues root cause, but assuming > disconnections will always occur, are there any other options to overcome > this issues? > > > > On Mon, Aug 10, 2015 at 11:18 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> I didn't see the zk timeout you set (just skimmed). But if your Zookeeper >> was >> down _very_ termporarily, it may suffice to up the ZK timeout. The default >> in the 10.4 time-frame (if I remember correctly) was 15 seconds which has >> proven to be too short in many circumstances. >> >> Of course if your ZK was down for minutest this wouldn't help. >> >> Best, >> Erick >> >> On Mon, Aug 10, 2015 at 1:06 PM, danny teichthal <dannyt...@gmail.com> >> wrote: >> > Hi Alexander , >> > Thanks for your reply, I looked at the release notes. >> > There is one bug fix - SOLR-7503 >> > <https://issues.apache.org/jira/browse/SOLR-7503> – register cores >> > asynchronously. >> > It may reduce the registration time since it is done on parallel, but >> > still, 3 minutes (leaderVoteWait) is a long time to recover from a few >> > seconds of disconnection. >> > >> > Except from that one I don't see any bug fix that addresses the same >> > problem. >> > I am able to reproduce it on 4.10.4 pretty easily, I will also try it >> with >> > 5.2.1 and see if it reproduces. >> > >> > Anyway, since migrating to 5.2.1 is not an option for me in the short >> term, >> > I'm left with the question if reducing leaderVoteWait may help here, and >> > what may be the consequences. >> > If i understand correctly, there might be a chance of losing updates that >> > were made on leader. >> > From my side it is a lot worse to lose availability for 3 minutes. >> > >> > I would really appreciate a feedback on this. >> > >> > >> > >> > >> > On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch < >> arafa...@gmail.com> >> > wrote: >> > >> >> Did you look at release notes for Solr versions after your own? >> >> >> >> I am pretty sure some similar things were identified and/or resolved >> >> for 5.x. It may not help if you cannot migrate, but would at least >> >> give a confirmation and maybe workaround on what you are facing. >> >> >> >> Regards, >> >> Alex. >> >> ---- >> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: >> >> http://www.solr-start.com/ >> >> >> >> >> >> On 10 August 2015 at 11:37, danny teichthal <dannyt...@gmail.com> >> wrote: >> >> > Hi, >> >> > We are using Solr cloud with solr 4.10.4. >> >> > On the passed week we encountered a problem where all of our servers >> >> > disconnected from zookeeper cluster. >> >> > This might be ok, the problem is that after reconnecting to zookeeper >> it >> >> > looks like for every collection both replicas do not have a leader and >> >> are >> >> > stuck in some kind of a deadlock for a few minutes. >> >> > >> >> > From what we understand: >> >> > One of the replicas assume it ill be the leader and at some point >> >> starting >> >> > to wait on leaderVoteWait, which is by default 3 minutes. >> >> > The other replica is stuck on this part of code for a few minutes: >> >> > at >> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957) >> >> > at >> >> > >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921) >> >> > at >> >> > >> >> >> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521) >> >> > at >> >> > >> >> >> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392) >> >> > >> >> > Looks like replica 1 waits for a leader to be registered in the >> >> zookeeper, >> >> > but replica 2 is waiting for replica 1. >> >> > >> >> >> (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp). >> >> > >> >> > We have 100 collections distributed in 3 pairs of Solr nodes. Each >> >> > collection has one shard with 2 replicas. >> >> > As I understand from code and logs, all the collections are being >> >> > registered synchronously, which means that we have to wait 3 minutes * >> >> > number of collections for the whole cluster to come up. It could be >> more >> >> > than an hour! >> >> > >> >> > >> >> > >> >> > 1. We thought about lowering leaderVoteWait to solve the problem, but >> we >> >> > are not sure what is the risk? >> >> > >> >> > 2. The following thread is very similar to our case: >> >> > >> >> >> http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down >> >> . >> >> > Does anybody know if it is indeed a bug and if there's a related JIRA >> >> issue? >> >> > >> >> > 3. I see this on logs before the reconnection "Client session timed >> out, >> >> > have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, >> >> > closing socket connection and attempting reconnect", does it mean that >> >> > there was a disconnection of over 50 seconds between SOLR and >> zookeeper? >> >> > >> >> > >> >> > Thanks in advance for your kind answer >> >> >>