Thanks Mark, sure thing. I'll do that tonight. jlg
On 6/3/13 9:03 PM, Mark Miller-3 [via Lucene] wrote: > Thanks - I can try and look into this perhaps next week. You might > copy the details into a JIRA issue to prevent it from getting lost > though... > > - Mark > > On Jun 3, 2013, at 4:46 PM, John Guerrero <[hidden email] > </user/SendEmail.jtp?type=node&node=4068025&i=0>> wrote: > > > SOLR 4.2.1, tomcat 6.0.35, CentOS 6.2 (2.6.32-220.4.1.el6.x86_64 #1 > SMP), > > java 6u27 64 bit > > 6 nodes, 2 shards, 3 replicas each. Names changed to r1s2 (replica1 > - shard > > 2), r2s2, and r3s2 for each replica in shard 2. > > > > What we see: > > > > * Under production load, we restart a leader (r1s2), and observe in the > > cloud admin > > that the old leader is in state "Down" and no new leader is ever > elected. > > * The system will stay like this until we stop the old leader (or > cause a ZK > > timeout...see below). > > > > *Please note:* the leader is killed, then kill -9'd 5 seconds later, > before > > restarting. We have since changed this. > > > > Digging into the logs on the old leader (r1s2 = replica1-shard 2): > > > > * The old leader restarted at 5:23:29 PM, but appears to be stuck in > > SolrDispatchFilter.init() -- (See recovery at bottom). > > * It doesn't want to become leader, possibly due to the unclean > shutdown. > > May 28, 2013 5:24:42 PM org.apache.solr.update.PeerSync handleVersions > > INFO: PeerSync: core=browse url=http://r1s2:8080/solr Our versions > are too > > old. ourHighThreshold=1436325665147191297 > > otherLowThreshold=1436325775374548992 > > * It then tries to recover, but cannot, because there is no leader. > > May 28, 2013 5:24:43 PM org.apache.solr.common.SolrException log > > SEVERE: Error while trying to recover. > > core=browse:org.apache.solr.common.SolrException: No registered > leader was > > found, collection:browse slice:shard2 > > * Meanwhile, it appears that blocking in init(), prevents the http-8080 > > handler from starting (See recovery at bottom). > > > > Digging into the other replicas (r2s2): > > > > * For some reason, the old leader (r1s2) remains in the list of > replicas > > that r2s2 attempts to sync to. > > May 28, 2013 5:23:42 PM org.apache.solr.update.PeerSync sync > > INFO: PeerSync: core=browse url=http://r2s2:8080/solr START > > replicas=[http://r1s2:8080/solr/browse/, http://r3s2:8080/solr/browse/] > > nUpdates=100 > > * This apparently fails (30 second timeout), possibly due to http-8080 > > handler not being started on r1s2. > > May 28, 2013 5:24:12 PM org.apache.solr.update.PeerSync handleResponse > > WARNING: PeerSync: core=browse url=http://r2s2:8080/solr exception > talking > > to http://r1s2:8080/solr/browse/, failed > > org.apache.solr.client.solrj.SolrServerException: Timeout occured while > > waiting response from server at: http://r1s2:8080/solr/browse > > > > *At this point, the cluster will remain indefinitely without a > leader, if > > nothing else changes.* > > > > But in this particular instance, we took some stack and heap dumps from > > r1s2, which paused java > > long enough to cause a *zookeeper timeout on the old leader (r1s2)*: > > May 28, 2013 5:33:26 PM org.apache.zookeeper.ClientCnxn$SendThread run > > INFO: Client session timed out, have not heard from server in > 38226ms for > > sessionid 0x23d28e0f584005d, closing socket connection and attempting > > reconnect > > > > Then, one of the replicas (r3s2) finally stopped trying to sync to > r1s2 and > > succeeded in becoming leader: > > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync > > INFO: PeerSync: core=browse url=http://r3s2:8080/solr START > > replicas=[http://r2s2:8080/solr/browse/] nUpdates=100 > > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions > > INFO: PeerSync: core=browse url=http://r3s2:8080/solr Received 100 > versions > > from r2s2:8080/solr/browse/ > > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions > > INFO: PeerSync: core=browse url=http://r3s2:8080/solr Our versions are > > newer. ourLowThreshold=1436325775374548992 > otherHigh=1436325775805513730 > > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync > > INFO: PeerSync: core=browse url=http://r3s2:8080/solr DONE. sync > succeeded > > > > Now that we have a leader, r1s2 can succeed in recovery and finish > > SolrDispatchFilter.init(), > > apparently allowing the http-8080 handler to start (r1s2). > > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy replay > > INFO: No replay needed. core=browse > > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy > doRecovery > > INFO: Replication Recovery was successful - registering as Active. > > core=browse > > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish > > INFO: publishing core=browse state=active > > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish > > INFO: numShards not found on descriptor - reading it from system > property > > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy > doRecovery > > INFO: Finished recovery process. core=browse > > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy run > > INFO: Starting recovery process. core=browse > recoveringAfterStartup=false > > May 28, 2013 5:34:49 PM org.apache.solr.common.cloud.ZkStateReader > > updateClusterState > > INFO: Updating cloud state from ZooKeeper... > > May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init > > INFO: user.dir=/ > > May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init > > *INFO: SolrDispatchFilter.init() done* > > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish > > INFO: publishing core=browse state=recovering > > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish > > INFO: numShards not found on descriptor - reading it from system > property > > May 28, 2013 5:34:49 PM > org.apache.solr.client.solrj.impl.HttpClientUtil > > createClient > > INFO: Creating new http client, > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > May 28, 2013 5:34:49 PM org.apache.catalina.startup.HostConfig > > deployDirectory > > INFO: Deploying web application directory ROOT > > May 28, 2013 5:34:49 PM org.apache.coyote.http11.Http11AprProtocol > start > > *INFO: Starting Coyote HTTP/1.1 on http-8080 #<---------------------- > > http-8080 started!* > > May 28, 2013 5:34:49 PM org.apache.coyote.ajp.AjpAprProtocol start > > INFO: Starting Coyote AJP/1.3 on ajp-8009 > > May 28, 2013 5:34:49 PM org.apache.catalina.startup.Catalina start > > *INFO: Server startup in 668324 ms #<------------------------------ > 668 sec > > = 11 minutes to start Catalina.* > > > > Our Workaround: > > > > * We changed our script to allow 15 seconds before kill -9. > > * Also, we no longer do a restart. We just stop the leader and wait > for a > > new leader. This > > still results in a few "No registered leader was found" exceptions, > but at > > least the duration is short. > > > > > > > > > > > > -- > > View this message in context: > http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-tp4067988.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > > > ------------------------------------------------------------------------ > If you reply to this email, your message will be added to the > discussion below: > http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-tp4067988p4068025.html > > > To unsubscribe from Leader election deadlock after restarting leader > in 4.2.1, click here > <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4067988&code=amd1ZXJyZXJvQHRhZ2dlZC5jb218NDA2Nzk4OHwtMjQ5MDMxNjYy>. > NAML > <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > -- View this message in context: http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-tp4067988p4068110.html Sent from the Solr - User mailing list archive at Nabble.com.