Hi Mark, I figured out what got the cluster into this bad state. I did a rolling restart and one of the JVM processes wasn't killed off before I restarted it, i.e. there were two Solr JVM processes running for the same shard. (Perhaps some things happen in Solr before Jetty fails to bind on the port already in use??). Bottom-line - kill processes before you restart them. Thanks for the help.
Tim On Mon, Mar 18, 2013 at 12:27 PM, Timothy Potter <thelabd...@gmail.com>wrote: > Hi Mark, > > Thanks for responding. > > Looking under /collections/solr_signal/leader_elect/shard5/election/ there > are 2 nodes: > > 161276082334072879-ADDR1:8983_solr_solr_signal-n_0000000053 - *Mon Mar 18 > 17:36:41 UTC 2013* > 161276082334072880-ADDR2:8983_solr_solr_signal-n_0000000056 - *Mon Mar 18 > 17:48:22 UTC 2013* > > So looks like the election of ADDR2 (the node that cannot recover) is > later than ADDR1 (node is still online and serving requests) > > Could I just delete that newer node from ZK? > * > * > *Cheers,* > *Tim* > * > * > On Mon, Mar 18, 2013 at 12:04 PM, Mark Miller <markrmil...@gmail.com>wrote: > >> Hmm… >> >> Sounds like it's a defensive mechanism we have where a leader will check >> it's own state about whether it thinks it's the leader with the zk info. In >> this case it's own state is not convinced of it's leadership. That's just a >> volatile boolean that gets flipped on when elected. >> >> What do the election nodes in ZooKeeper say? Who do they think the leader >> is? >> >> Something is off, but I'm kind of surprised restarting the leader doesn't >> fix it. Someone else should register as the leader or the restarted node >> should reclaim it's spot. >> >> I have no idea if this is solved in 4.2 or not since I don't really know >> what's happened, but I'd love to get to the bottom of it. >> >> After setting the leader volatile boolean to true, the only way it goes >> false other than restart is session expiration. In that case we do flip to >> false - but session expiration should also cause the leader node to drop… >> >> >> - Mark >> >> On Mar 18, 2013, at 1:57 PM, Timothy Potter <thelabd...@gmail.com> wrote: >> >> > Having an issue running on a nightly build of Solr 4.1 (tag - >> > 4.1.0.2013.01.10.20.44.27) >> > >> > I had a replica fail and when trying to bring it back online, recovery >> > fails because the leader responds with "We are not the leader" (see >> trace >> > below). >> > >> > SEVERE: org.apache.solr.common.SolrException: We are not the leader >> > >> > at >> > >> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:907) >> > >> > at >> > >> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188) >> > >> > at >> > >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >> > >> > at >> > >> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:365) >> > >> > at >> > >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174) >> > >> > at >> > >> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) >> > >> > ... >> > >> > The worrisome part is the clusterstate.json seems to show this node >> (ADDR1) >> > is the leader (I obfuscated addresses using ADDR1 and 2): >> > >> > "shard5":{ >> > >> > "range":"b8e30000-c71bffff", >> > >> > "replicas":{ >> > >> > "ADDR1:8983_solr_solr_signal":{ >> > >> > "shard":"shard5", >> > >> > "roles":null, >> > >> > "state":"active", >> > >> > "core":"solr_signal", >> > >> > "collection":"solr_signal", >> > >> > "node_name":"ADDR1:8983_solr", >> > >> > "base_url":"http://ADDR1:8983/solr", >> > >> > * "leader":"true"},* >> > >> > "ADDR2:8983_solr_solr_signal":{ >> > >> > "shard":"shard5", >> > >> > "roles":null, >> > >> > * "state":"recovering",* >> > >> > "core":"solr_signal", >> > >> > "collection":"solr_signal", >> > >> > "node_name":"ADDR2:8983_solr", >> > >> > "base_url":"http://ADDR2:8983/solr"}}}, >> > >> > >> > I assume the obvious answer is to upgrade to 4.2. I'm willing to go down >> > that path but wanted to see if there was something quick I could do to >> get >> > the leader to start thinking it is the leader again. Restarting it >> doesn't >> > seem to do the trick. >> >> >