When leader reaches 99% physical memory on the box and starts swapping (stops replicating), we forcefully bring down leader (first kill -15 and then kill -9 if kill -15 doesn't work). This is when we are looking up to replica to assume leader's role and it never happens.
Zookeeper timeout is 45 seconds. We can increase it up to 2 minutes and test. <cores adminPath="/admin/cores" defaultCoreName="collection1" host="${host:}" hostPort="${jetty.port:8983}" hostContext="${hostContext:solr}" zkClientTimeout="${zkClientTimeout:45000}"> As per definition of zkClientTimeout, After the leader is brought down and it doesn't talk to zookeeper for 45 seconds, shouldn't ZK promote replica to leader? I am not sure how increasing zk timeout will help. -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, January 28, 2015 11:42 AM To: solr-user@lucene.apache.org Subject: Re: replica never takes leader role This is not the desired behavior at all. I know there have been improvements in this area since 4.8, but can't seem to locate the JIRAs. I'm curious _why_ the nodes are going down though, is it happening at random or are you taking it down? One problem has been that the Zookeeper timeout used to default to 15 seconds, and occasionally a node would be unresponsive (sometimes due to GC pauses) and exceed the timeout. So upping the ZK timeout has helped some people avoid this... FWIW, Erick On Wed, Jan 28, 2015 at 7:11 AM, Joshi, Shital <shital.jo...@gs.com> wrote: > We're using Solr 4.8.0 > > > -----Original Message----- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Tuesday, January 27, 2015 7:47 PM > To: solr-user@lucene.apache.org > Subject: Re: replica never takes leader role > > What version of Solr? This is an ongoing area of improvements and several > are very recent. > > Try searching the JIRA for Solr for details. > > Best, > Erick > > On Tue, Jan 27, 2015 at 1:51 PM, Joshi, Shital <shital.jo...@gs.com> > wrote: > > > Hello, > > > > We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three > > zookeeper instances. We have noticed that when a leader node goes down > the > > replica never takes over as a leader, cloud becomes unusable and we have > to > > bounce entire cloud for replica to assume leader role. Is this default > > behavior? How can we change this? > > > > Thanks. > > > > > > >