Thanks Mark, sure thing.  I'll do that tonight.

jlg

On 6/3/13 9:03 PM, Mark Miller-3 [via Lucene] wrote:
> Thanks - I can try and look into this perhaps next week. You might 
> copy the details into a JIRA issue to prevent it from getting lost 
> though...
>
> - Mark
>
> On Jun 3, 2013, at 4:46 PM, John Guerrero <[hidden email] 
> </user/SendEmail.jtp?type=node&node=4068025&i=0>> wrote:
>
> > SOLR 4.2.1, tomcat 6.0.35, CentOS 6.2 (2.6.32-220.4.1.el6.x86_64 #1 
> SMP),
> > java 6u27 64 bit
> > 6 nodes, 2 shards, 3 replicas each.  Names changed to r1s2 (replica1 
> - shard
> > 2), r2s2, and r3s2 for each replica in shard 2.
> >
> > What we see:
> >
> > * Under production load, we restart a leader (r1s2), and observe in the
> > cloud admin
> > that the old leader is in state "Down" and no new leader is ever 
> elected.
> > * The system will stay like this until we stop the old leader (or 
> cause a ZK
> > timeout...see below).
> >
> > *Please note:* the leader is killed, then kill -9'd 5 seconds later, 
> before
> > restarting.  We have since changed this.
> >
> > Digging into the logs on the old leader (r1s2 = replica1-shard 2):
> >
> > * The old leader restarted at 5:23:29 PM, but appears to be stuck in
> > SolrDispatchFilter.init() -- (See recovery at bottom).
> > * It doesn't want to become leader, possibly due to the unclean 
> shutdown.
> > May 28, 2013 5:24:42 PM org.apache.solr.update.PeerSync handleVersions
> > INFO: PeerSync: core=browse url=http://r1s2:8080/solr  Our versions 
> are too
> > old. ourHighThreshold=1436325665147191297
> > otherLowThreshold=1436325775374548992
> > * It then tries to recover, but cannot, because there is no leader.
> > May 28, 2013 5:24:43 PM org.apache.solr.common.SolrException log
> > SEVERE: Error while trying to recover.
> > core=browse:org.apache.solr.common.SolrException: No registered 
> leader was
> > found, collection:browse slice:shard2
> > * Meanwhile, it appears that blocking in init(), prevents the http-8080
> > handler from starting (See recovery at bottom).
> >
> > Digging into the other replicas (r2s2):
> >
> > * For some reason, the old leader (r1s2) remains in the list of 
> replicas
> > that r2s2 attempts to sync to.
> > May 28, 2013 5:23:42 PM org.apache.solr.update.PeerSync sync
> > INFO: PeerSync: core=browse url=http://r2s2:8080/solr START
> > replicas=[http://r1s2:8080/solr/browse/, http://r3s2:8080/solr/browse/]
> > nUpdates=100
> > * This apparently fails (30 second timeout), possibly due to http-8080
> > handler not being started on r1s2.
> > May 28, 2013 5:24:12 PM org.apache.solr.update.PeerSync handleResponse
> > WARNING: PeerSync: core=browse url=http://r2s2:8080/solr exception 
> talking
> > to http://r1s2:8080/solr/browse/, failed
> > org.apache.solr.client.solrj.SolrServerException: Timeout occured while
> > waiting response from server at: http://r1s2:8080/solr/browse
> >
> > *At this point, the cluster will remain indefinitely without a 
> leader, if
> > nothing else changes.*
> >
> > But in this particular instance, we took some stack and heap dumps from
> > r1s2, which paused java
> > long enough to cause a *zookeeper timeout on the old leader (r1s2)*:
> > May 28, 2013 5:33:26 PM org.apache.zookeeper.ClientCnxn$SendThread run
> > INFO: Client session timed out, have not heard from server in 
> 38226ms for
> > sessionid 0x23d28e0f584005d, closing socket connection and attempting
> > reconnect
> >
> > Then, one of the replicas (r3s2) finally stopped trying to sync to 
> r1s2 and
> > succeeded in becoming leader:
> > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync
> > INFO: PeerSync: core=browse url=http://r3s2:8080/solr START
> > replicas=[http://r2s2:8080/solr/browse/] nUpdates=100
> > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions
> > INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Received 100 
> versions
> > from r2s2:8080/solr/browse/
> > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions
> > INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Our versions are
> > newer. ourLowThreshold=1436325775374548992 
> otherHigh=1436325775805513730
> > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync
> > INFO: PeerSync: core=browse url=http://r3s2:8080/solr DONE. sync 
> succeeded
> >
> > Now that we have a leader, r1s2 can succeed in recovery and finish
> > SolrDispatchFilter.init(),
> > apparently allowing the http-8080 handler to start (r1s2).
> > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy replay
> > INFO: No replay needed. core=browse
> > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy 
> doRecovery
> > INFO: Replication Recovery was successful - registering as Active.
> > core=browse
> > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
> > INFO: publishing core=browse state=active
> > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
> > INFO: numShards not found on descriptor - reading it from system 
> property
> > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy 
> doRecovery
> > INFO: Finished recovery process. core=browse
> > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy run
> > INFO: Starting recovery process.  core=browse 
> recoveringAfterStartup=false
> > May 28, 2013 5:34:49 PM org.apache.solr.common.cloud.ZkStateReader
> > updateClusterState
> > INFO: Updating cloud state from ZooKeeper...
> > May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init
> > INFO: user.dir=/
> > May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init
> > *INFO: SolrDispatchFilter.init() done*
> > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
> > INFO: publishing core=browse state=recovering
> > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
> > INFO: numShards not found on descriptor - reading it from system 
> property
> > May 28, 2013 5:34:49 PM 
> org.apache.solr.client.solrj.impl.HttpClientUtil
> > createClient
> > INFO: Creating new http client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > May 28, 2013 5:34:49 PM org.apache.catalina.startup.HostConfig
> > deployDirectory
> > INFO: Deploying web application directory ROOT
> > May 28, 2013 5:34:49 PM org.apache.coyote.http11.Http11AprProtocol 
> start
> > *INFO: Starting Coyote HTTP/1.1 on http-8080  #<----------------------
> > http-8080 started!*
> > May 28, 2013 5:34:49 PM org.apache.coyote.ajp.AjpAprProtocol start
> > INFO: Starting Coyote AJP/1.3 on ajp-8009
> > May 28, 2013 5:34:49 PM org.apache.catalina.startup.Catalina start
> > *INFO: Server startup in 668324 ms #<------------------------------ 
> 668 sec
> > = 11 minutes to start Catalina.*
> >
> > Our Workaround:
> >
> > * We changed our script to allow 15 seconds before kill -9.
> > * Also, we no longer do a restart.  We just stop the leader and wait 
> for a
> > new leader.  This
> > still results in a few "No registered leader was found" exceptions, 
> but at
> > least the duration is short.
> >
> >
> >
> >
> >
> > --
> > View this message in context: 
> http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-tp4067988.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>
> ------------------------------------------------------------------------
> If you reply to this email, your message will be added to the 
> discussion below:
> http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-tp4067988p4068025.html
>  
>
> To unsubscribe from Leader election deadlock after restarting leader 
> in 4.2.1, click here 
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4067988&code=amd1ZXJyZXJvQHRhZ2dlZC5jb218NDA2Nzk4OHwtMjQ5MDMxNjYy>.
> NAML 
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>  
>





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-tp4067988p4068110.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to