Thanks - I can try and look into this perhaps next week. You might copy the details into a JIRA issue to prevent it from getting lost though...
- Mark On Jun 3, 2013, at 4:46 PM, John Guerrero <jguerr...@tagged.com> wrote: > SOLR 4.2.1, tomcat 6.0.35, CentOS 6.2 (2.6.32-220.4.1.el6.x86_64 #1 SMP), > java 6u27 64 bit > 6 nodes, 2 shards, 3 replicas each. Names changed to r1s2 (replica1 - shard > 2), r2s2, and r3s2 for each replica in shard 2. > > What we see: > > * Under production load, we restart a leader (r1s2), and observe in the > cloud admin > that the old leader is in state "Down" and no new leader is ever elected. > * The system will stay like this until we stop the old leader (or cause a ZK > timeout...see below). > > *Please note:* the leader is killed, then kill -9'd 5 seconds later, before > restarting. We have since changed this. > > Digging into the logs on the old leader (r1s2 = replica1-shard 2): > > * The old leader restarted at 5:23:29 PM, but appears to be stuck in > SolrDispatchFilter.init() -- (See recovery at bottom). > * It doesn't want to become leader, possibly due to the unclean shutdown. > May 28, 2013 5:24:42 PM org.apache.solr.update.PeerSync handleVersions > INFO: PeerSync: core=browse url=http://r1s2:8080/solr Our versions are too > old. ourHighThreshold=1436325665147191297 > otherLowThreshold=1436325775374548992 > * It then tries to recover, but cannot, because there is no leader. > May 28, 2013 5:24:43 PM org.apache.solr.common.SolrException log > SEVERE: Error while trying to recover. > core=browse:org.apache.solr.common.SolrException: No registered leader was > found, collection:browse slice:shard2 > * Meanwhile, it appears that blocking in init(), prevents the http-8080 > handler from starting (See recovery at bottom). > > Digging into the other replicas (r2s2): > > * For some reason, the old leader (r1s2) remains in the list of replicas > that r2s2 attempts to sync to. > May 28, 2013 5:23:42 PM org.apache.solr.update.PeerSync sync > INFO: PeerSync: core=browse url=http://r2s2:8080/solr START > replicas=[http://r1s2:8080/solr/browse/, http://r3s2:8080/solr/browse/] > nUpdates=100 > * This apparently fails (30 second timeout), possibly due to http-8080 > handler not being started on r1s2. > May 28, 2013 5:24:12 PM org.apache.solr.update.PeerSync handleResponse > WARNING: PeerSync: core=browse url=http://r2s2:8080/solr exception talking > to http://r1s2:8080/solr/browse/, failed > org.apache.solr.client.solrj.SolrServerException: Timeout occured while > waiting response from server at: http://r1s2:8080/solr/browse > > *At this point, the cluster will remain indefinitely without a leader, if > nothing else changes.* > > But in this particular instance, we took some stack and heap dumps from > r1s2, which paused java > long enough to cause a *zookeeper timeout on the old leader (r1s2)*: > May 28, 2013 5:33:26 PM org.apache.zookeeper.ClientCnxn$SendThread run > INFO: Client session timed out, have not heard from server in 38226ms for > sessionid 0x23d28e0f584005d, closing socket connection and attempting > reconnect > > Then, one of the replicas (r3s2) finally stopped trying to sync to r1s2 and > succeeded in becoming leader: > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync > INFO: PeerSync: core=browse url=http://r3s2:8080/solr START > replicas=[http://r2s2:8080/solr/browse/] nUpdates=100 > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions > INFO: PeerSync: core=browse url=http://r3s2:8080/solr Received 100 versions > from r2s2:8080/solr/browse/ > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions > INFO: PeerSync: core=browse url=http://r3s2:8080/solr Our versions are > newer. ourLowThreshold=1436325775374548992 otherHigh=1436325775805513730 > May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync > INFO: PeerSync: core=browse url=http://r3s2:8080/solr DONE. sync succeeded > > Now that we have a leader, r1s2 can succeed in recovery and finish > SolrDispatchFilter.init(), > apparently allowing the http-8080 handler to start (r1s2). > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy replay > INFO: No replay needed. core=browse > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery > INFO: Replication Recovery was successful - registering as Active. > core=browse > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish > INFO: publishing core=browse state=active > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish > INFO: numShards not found on descriptor - reading it from system property > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery > INFO: Finished recovery process. core=browse > May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy run > INFO: Starting recovery process. core=browse recoveringAfterStartup=false > May 28, 2013 5:34:49 PM org.apache.solr.common.cloud.ZkStateReader > updateClusterState > INFO: Updating cloud state from ZooKeeper... > May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init > INFO: user.dir=/ > May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init > *INFO: SolrDispatchFilter.init() done* > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish > INFO: publishing core=browse state=recovering > May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish > INFO: numShards not found on descriptor - reading it from system property > May 28, 2013 5:34:49 PM org.apache.solr.client.solrj.impl.HttpClientUtil > createClient > INFO: Creating new http client, > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > May 28, 2013 5:34:49 PM org.apache.catalina.startup.HostConfig > deployDirectory > INFO: Deploying web application directory ROOT > May 28, 2013 5:34:49 PM org.apache.coyote.http11.Http11AprProtocol start > *INFO: Starting Coyote HTTP/1.1 on http-8080 #<---------------------- > http-8080 started!* > May 28, 2013 5:34:49 PM org.apache.coyote.ajp.AjpAprProtocol start > INFO: Starting Coyote AJP/1.3 on ajp-8009 > May 28, 2013 5:34:49 PM org.apache.catalina.startup.Catalina start > *INFO: Server startup in 668324 ms #<------------------------------ 668 sec > = 11 minutes to start Catalina.* > > Our Workaround: > > * We changed our script to allow 15 seconds before kill -9. > * Also, we no longer do a restart. We just stop the leader and wait for a > new leader. This > still results in a few "No registered leader was found" exceptions, but at > least the duration is short. > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-tp4067988.html > Sent from the Solr - User mailing list archive at Nabble.com.