Hi Mark, Thanks for responding.
Looking under /collections/solr_signal/leader_elect/shard5/election/ there are 2 nodes: 161276082334072879-ADDR1:8983_solr_solr_signal-n_0000000053 - *Mon Mar 18 17:36:41 UTC 2013* 161276082334072880-ADDR2:8983_solr_solr_signal-n_0000000056 - *Mon Mar 18 17:48:22 UTC 2013* So looks like the election of ADDR2 (the node that cannot recover) is later than ADDR1 (node is still online and serving requests) Could I just delete that newer node from ZK? * * *Cheers,* *Tim* * * On Mon, Mar 18, 2013 at 12:04 PM, Mark Miller <markrmil...@gmail.com> wrote: > Hmm… > > Sounds like it's a defensive mechanism we have where a leader will check > it's own state about whether it thinks it's the leader with the zk info. In > this case it's own state is not convinced of it's leadership. That's just a > volatile boolean that gets flipped on when elected. > > What do the election nodes in ZooKeeper say? Who do they think the leader > is? > > Something is off, but I'm kind of surprised restarting the leader doesn't > fix it. Someone else should register as the leader or the restarted node > should reclaim it's spot. > > I have no idea if this is solved in 4.2 or not since I don't really know > what's happened, but I'd love to get to the bottom of it. > > After setting the leader volatile boolean to true, the only way it goes > false other than restart is session expiration. In that case we do flip to > false - but session expiration should also cause the leader node to drop… > > > - Mark > > On Mar 18, 2013, at 1:57 PM, Timothy Potter <thelabd...@gmail.com> wrote: > > > Having an issue running on a nightly build of Solr 4.1 (tag - > > 4.1.0.2013.01.10.20.44.27) > > > > I had a replica fail and when trying to bring it back online, recovery > > fails because the leader responds with "We are not the leader" (see trace > > below). > > > > SEVERE: org.apache.solr.common.SolrException: We are not the leader > > > > at > > > org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:907) > > > > at > > > org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188) > > > > at > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > > > > at > > > org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:365) > > > > at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174) > > > > at > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) > > > > ... > > > > The worrisome part is the clusterstate.json seems to show this node > (ADDR1) > > is the leader (I obfuscated addresses using ADDR1 and 2): > > > > "shard5":{ > > > > "range":"b8e30000-c71bffff", > > > > "replicas":{ > > > > "ADDR1:8983_solr_solr_signal":{ > > > > "shard":"shard5", > > > > "roles":null, > > > > "state":"active", > > > > "core":"solr_signal", > > > > "collection":"solr_signal", > > > > "node_name":"ADDR1:8983_solr", > > > > "base_url":"http://ADDR1:8983/solr", > > > > * "leader":"true"},* > > > > "ADDR2:8983_solr_solr_signal":{ > > > > "shard":"shard5", > > > > "roles":null, > > > > * "state":"recovering",* > > > > "core":"solr_signal", > > > > "collection":"solr_signal", > > > > "node_name":"ADDR2:8983_solr", > > > > "base_url":"http://ADDR2:8983/solr"}}}, > > > > > > I assume the obvious answer is to upgrade to 4.2. I'm willing to go down > > that path but wanted to see if there was something quick I could do to > get > > the leader to start thinking it is the leader again. Restarting it > doesn't > > seem to do the trick. > >