Zookeeper issue causing all nodes to fail

philippa griggs Mon, 26 Oct 2015 10:07:23 -0700

Hello all,


We have been experiencing some major solr issues.


Solr 5.2.1 10 Shards each with a replica (20 nodes in total).

Three external zookeepers 3.4.6


Node 19 went down, a short while after this occurred all our nodes were wiped 
out.  The cloud diagram, live_nodes and clusterstate.json all showed different 
nodes as being down/active and when refreshed it changed.


Looking at the logs across all the nodes there were zookeeper errors (a couple 
of examples below), however there was no out of the ordinary errors in 
zookeeper.



WARN  - 2015-10-22 09:39:48.536; [   ] org.apache.solr.cloud.ZkController$4; 
listener throws error

org.apache.solr.common.SolrException: 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /configs/XXX/params.json

       at 
org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:163)

       at 
org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:926)

       at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2580)

       at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2376)

Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /configs/XXX/params.json

       at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)

       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)

       at 
org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294)

       at 
org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291)

       at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)

       at 
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291)

       at 
org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:153)

       ... 3 more


ERROR - 2015-10-26 11:28:13.141; [XXX shard6  ] 
org.apache.solr.common.SolrException; There was a problem trying to register as 
the leader:org.apache.solr.common.SolrException: Could not register as the 
leader because creating the ephemeral registration node in ZooKeeper failed

       at 
org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:154)

       at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:330)

       at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:198)

       at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:159)

       at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:348)

       at 
org.apache.solr.cloud.ZkController.joinElection(ZkController.java:1075)

       at org.apache.solr.cloud.ZkController.register(ZkController.java:888)

       at 
org.apache.solr.cloud.ZkController$RegisterCoreAsync.call(ZkController.java:226)

       at java.util.concurrent.FutureTask.run(Unknown Source)

       at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148)

       at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

       at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

       at java.lang.Thread.run(Unknown Source)



ERROR - 2015-10-26 11:29:45.757; [XXX shard6  ] 
org.apache.solr.common.SolrException; Error while trying to recover. 
core=XXX:org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /overseer/queue/qn-

       at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)

       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

       at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)

       at 
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:380)

       at 
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:377)

       at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)

       at 
org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:377)

       at 
org.apache.solr.cloud.DistributedQueue.createData(DistributedQueue.java:380)

       at 
org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:364)

       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1219)

       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1129)

       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1125)

       at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:348)

       at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:229)




WARN  - 2015-10-26 11:32:03.116; [   XXX] org.apache.solr.cloud.ZkController$4; 
listener throws error

org.apache.solr.common.SolrException: Unable to reload core [XXX]

       at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:743)

       at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2586)

       at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2376)

Caused by: org.apache.solr.common.SolrException: Error opening new searcher

       at org.apache.solr.core.SolrCore.<init>(SolrCore.java:815)

       at org.apache.solr.core.SolrCore.reload(SolrCore.java:455)

       at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:738)

       ... 2 more

Caused by: org.apache.solr.common.SolrException: Error opening new searcher

       at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1657)

       at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1769)

       at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:911)

       at org.apache.solr.core.SolrCore.<init>(SolrCore.java:788)

       ... 4 more

Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is 
closed

       at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:718)

       at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:732)

       at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:396)

       at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86)

       at 
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:41)

       at org.apache.solr.core.SolrCore$2.call(SolrCore.java:904)

       at org.apache.solr.core.SolrCore$2.call(SolrCore.java:899)

       at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1622)

       ... 7 more

Caused by: java.nio.channels.ClosedByInterruptException

       at java.nio.channels.spi.AbstractInterruptibleChannel.end(Unknown Source)

       at sun.nio.ch.FileChannelImpl.write(Unknown Source)

       at java.nio.channels.Channels.writeFullyImpl(Unknown Source)

       at java.nio.channels.Channels.writeFully(Unknown Source)

       at java.nio.channels.Channels.access$000(Unknown Source)

       at java.nio.channels.Channels$1.write(Unknown Source)

       at 
org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:268)

       at java.util.zip.CheckedOutputStream.write(Unknown Source)

       at java.io.BufferedOutputStream.flushBuffer(Unknown Source)

       at java.io.BufferedOutputStream.flush(Unknown Source)

       at 
org.apache.lucene.store.OutputStreamIndexOutput.getChecksum(OutputStreamIndexOutput.java:80)

       at org.apache.lucene.codecs.CodecUtil.writeFooter(CodecUtil.java:310)

       at 
org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.close(Lucene50DocValuesConsumer.java:571)




After restarting most of the nodes, we got to a point when only node 19 was in 
recovery (the first node which went down). However solr was still not working.  
Most nodes were receiving a lot of read timeout errors:


ERROR - 2015-10-26 12:09:05.621; [XXX shard8 Node18 XXX] 
org.apache.solr.update.StreamingSolrClients$1; error

java.net.SocketTimeoutException: Read timed out

       at java.net.SocketInputStream.socketRead0(Native Method)

       at java.net.SocketInputStream.socketRead(Unknown Source)

       at java.net.SocketInputStream.read(Unknown Source)

       at java.net.SocketInputStream.read(Unknown Source)

       at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)

       at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)

       at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)

       at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)

       at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)

       at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)

       at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)

       at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)

       at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)

       at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)

       at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)

       at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)

       at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)

       at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)

       at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)

       at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)

       at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)

       at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:243)

       at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148)

       at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

       at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

       at java.lang.Thread.run(Unknown Source)



Node 9 was listed as active on the cloud diagram, in the live_nodes list and 
was receiving no errors at all, while node 19 had two recurring errors:


WARN  - 2015-10-26 12:26:25.812; [   ] org.apache.solr.cloud.RecoveryStrategy; 
Stopping recovery for core=XXX coreNodeName=Node19

ERROR - 2015-10-26 12:26:29.807; [   ] org.apache.solr.common.SolrException; 
Error while trying to recover. core=XXX:org.apache.solr.common.SolrException: 
No registered leader was found after waiting for 4000ms , collection: XXXslice: 
shard9

       at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:684)

       at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:670)

       at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:325)

       at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:229)



Again there was nothing out the ordinary in the zookeeper logs. Restarting node 
9 fixed the issue.


My questions are:

  *   This seems to be a zookeeper problem. Could a session timeout on one 
node, cause all other nodes to fail?

  *   Why were there no issues flagged in the zookeeper logs?

  *   After restarting the nodes, they experienced read timeout errors- what 
are these errors?

  *   Why was node 9 listed as active when it wasn't- could this be related to 
the zookeeper issue?


Any help would be appreciated, I'm at a loss as to what happened and I really 
don't want this to happen again.


Kind regards


Philippa

Zookeeper issue causing all nodes to fail

Reply via email to