Re: Cluster down for long time after zookeeper disconnection

2015-08-11 Thread danny teichthal
1. Erik, thanks, I agree that it is really serious, but I think that the 3 minutes on this case were not mandatory. On my case it was a deadlock, which smells like some kind of bug. One replica is waiting for other to come up, before it takes leadership, while the other is waiting for the election

Re: Cluster down for long time after zookeeper disconnection

2015-08-10 Thread Erick Erickson
Not that I know of. With ZK as the "one source of truth", dropping below quorum is Really Serious, so having to wait 3 minutes or so for action to be taken is the fallback. Best, Erick On Mon, Aug 10, 2015 at 1:34 PM, danny teichthal wrote: > Erick, I assume you are referring to zkClientTimeout,

Re: Cluster down for long time after zookeeper disconnection

2015-08-10 Thread danny teichthal
Erick, I assume you are referring to zkClientTimeout, it is set to 30 seconds. I also see these messages on Solr side: "Client session timed out, have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, closing socket connection and attempting reconnect". So, I'm not sure what was th

Re: Cluster down for long time after zookeeper disconnection

2015-08-10 Thread Erick Erickson
I didn't see the zk timeout you set (just skimmed). But if your Zookeeper was down _very_ termporarily, it may suffice to up the ZK timeout. The default in the 10.4 time-frame (if I remember correctly) was 15 seconds which has proven to be too short in many circumstances. Of course if your ZK was

Re: Cluster down for long time after zookeeper disconnection

2015-08-10 Thread danny teichthal
Hi Alexander , Thanks for your reply, I looked at the release notes. There is one bug fix - SOLR-7503 – register cores asynchronously. It may reduce the registration time since it is done on parallel, but still, 3 minutes (leaderVoteWait) is a long

Re: Cluster down for long time after zookeeper disconnection

2015-08-10 Thread Alexandre Rafalovitch
Did you look at release notes for Solr versions after your own? I am pretty sure some similar things were identified and/or resolved for 5.x. It may not help if you cannot migrate, but would at least give a confirmation and maybe workaround on what you are facing. Regards, Alex. Solr Anal

Cluster down for long time after zookeeper disconnection

2015-08-10 Thread danny teichthal
Hi, We are using Solr cloud with solr 4.10.4. On the passed week we encountered a problem where all of our servers disconnected from zookeeper cluster. This might be ok, the problem is that after reconnecting to zookeeper it looks like for every collection both replicas do not have a leader and are