Thanks - I can try and look into this perhaps next week. You might copy the 
details into a JIRA issue to prevent it from getting lost though...

- Mark

On Jun 3, 2013, at 4:46 PM, John Guerrero <jguerr...@tagged.com> wrote:

> SOLR 4.2.1, tomcat 6.0.35, CentOS 6.2 (2.6.32-220.4.1.el6.x86_64 #1 SMP),
> java 6u27 64 bit
> 6 nodes, 2 shards, 3 replicas each.  Names changed to r1s2 (replica1 - shard
> 2), r2s2, and r3s2 for each replica in shard 2.
> 
> What we see:
> 
> * Under production load, we restart a leader (r1s2), and observe in the
> cloud admin
> that the old leader is in state "Down" and no new leader is ever elected.
> * The system will stay like this until we stop the old leader (or cause a ZK
> timeout...see below).
> 
> *Please note:* the leader is killed, then kill -9'd 5 seconds later, before
> restarting.  We have since changed this.
> 
> Digging into the logs on the old leader (r1s2 = replica1-shard 2):
> 
> * The old leader restarted at 5:23:29 PM, but appears to be stuck in
> SolrDispatchFilter.init() -- (See recovery at bottom).
> * It doesn't want to become leader, possibly due to the unclean shutdown.
> May 28, 2013 5:24:42 PM org.apache.solr.update.PeerSync handleVersions
> INFO: PeerSync: core=browse url=http://r1s2:8080/solr  Our versions are too
> old. ourHighThreshold=1436325665147191297
> otherLowThreshold=1436325775374548992
> * It then tries to recover, but cannot, because there is no leader.
> May 28, 2013 5:24:43 PM org.apache.solr.common.SolrException log
> SEVERE: Error while trying to recover.
> core=browse:org.apache.solr.common.SolrException: No registered leader was
> found, collection:browse slice:shard2
> * Meanwhile, it appears that blocking in init(), prevents the http-8080
> handler from starting (See recovery at bottom).
> 
> Digging into the other replicas (r2s2):
> 
> * For some reason, the old leader (r1s2) remains in the list of replicas
> that r2s2 attempts to sync to.
> May 28, 2013 5:23:42 PM org.apache.solr.update.PeerSync sync
> INFO: PeerSync: core=browse url=http://r2s2:8080/solr START
> replicas=[http://r1s2:8080/solr/browse/, http://r3s2:8080/solr/browse/]
> nUpdates=100
> * This apparently fails (30 second timeout), possibly due to http-8080
> handler not being started on r1s2.
> May 28, 2013 5:24:12 PM org.apache.solr.update.PeerSync handleResponse
> WARNING: PeerSync: core=browse url=http://r2s2:8080/solr  exception talking
> to http://r1s2:8080/solr/browse/, failed
> org.apache.solr.client.solrj.SolrServerException: Timeout occured while
> waiting response from server at: http://r1s2:8080/solr/browse
> 
> *At this point, the cluster will remain indefinitely without a leader, if
> nothing else changes.*
> 
> But in this particular instance, we took some stack and heap dumps from
> r1s2, which paused java
> long enough to cause a *zookeeper timeout on the old leader (r1s2)*:
> May 28, 2013 5:33:26 PM org.apache.zookeeper.ClientCnxn$SendThread run
> INFO: Client session timed out, have not heard from server in 38226ms for
> sessionid 0x23d28e0f584005d, closing socket connection and attempting
> reconnect
> 
> Then, one of the replicas (r3s2) finally stopped trying to sync to r1s2 and
> succeeded in becoming leader:
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr START
> replicas=[http://r2s2:8080/solr/browse/] nUpdates=100
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Received 100 versions
> from r2s2:8080/solr/browse/
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Our versions are
> newer. ourLowThreshold=1436325775374548992 otherHigh=1436325775805513730
> May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync
> INFO: PeerSync: core=browse url=http://r3s2:8080/solr DONE. sync succeeded
> 
> Now that we have a leader, r1s2 can succeed in recovery and finish
> SolrDispatchFilter.init(),
> apparently allowing the http-8080 handler to start (r1s2).
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy replay
> INFO: No replay needed. core=browse
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> INFO: Replication Recovery was successful - registering as Active.
> core=browse
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
> INFO: publishing core=browse state=active
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
> INFO: numShards not found on descriptor - reading it from system property
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> INFO: Finished recovery process. core=browse
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy run
> INFO: Starting recovery process.  core=browse recoveringAfterStartup=false
> May 28, 2013 5:34:49 PM org.apache.solr.common.cloud.ZkStateReader
> updateClusterState
> INFO: Updating cloud state from ZooKeeper...
> May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init
> INFO: user.dir=/
> May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init
> *INFO: SolrDispatchFilter.init() done*
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
> INFO: publishing core=browse state=recovering
> May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
> INFO: numShards not found on descriptor - reading it from system property
> May 28, 2013 5:34:49 PM org.apache.solr.client.solrj.impl.HttpClientUtil
> createClient
> INFO: Creating new http client,
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> May 28, 2013 5:34:49 PM org.apache.catalina.startup.HostConfig
> deployDirectory
> INFO: Deploying web application directory ROOT
> May 28, 2013 5:34:49 PM org.apache.coyote.http11.Http11AprProtocol start
> *INFO: Starting Coyote HTTP/1.1 on http-8080  #<----------------------
> http-8080 started!*
> May 28, 2013 5:34:49 PM org.apache.coyote.ajp.AjpAprProtocol start
> INFO: Starting Coyote AJP/1.3 on ajp-8009
> May 28, 2013 5:34:49 PM org.apache.catalina.startup.Catalina start
> *INFO: Server startup in 668324 ms #<------------------------------ 668 sec
> = 11 minutes to start Catalina.*
> 
> Our Workaround:
> 
> * We changed our script to allow 15 seconds before kill -9.
> * Also, we no longer do a restart.  We just stop the leader and wait for a
> new leader.  This
> still results in a few "No registered leader was found" exceptions, but at
> least the duration is short.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-tp4067988.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to