Hello all,
We have been experiencing some major solr issues. Solr 5.2.1 10 Shards each with a replica (20 nodes in total). Three external zookeepers 3.4.6 Node 19 went down, a short while after this occurred all our nodes were wiped out. The cloud diagram, live_nodes and clusterstate.json all showed different nodes as being down/active and when refreshed it changed. Looking at the logs across all the nodes there were zookeeper errors (a couple of examples below), however there was no out of the ordinary errors in zookeeper. WARN - 2015-10-22 09:39:48.536; [ ] org.apache.solr.cloud.ZkController$4; listener throws error org.apache.solr.common.SolrException: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /configs/XXX/params.json at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:163) at org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:926) at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2580) at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2376) Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /configs/XXX/params.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291) at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:153) ... 3 more ERROR - 2015-10-26 11:28:13.141; [XXX shard6 ] org.apache.solr.common.SolrException; There was a problem trying to register as the leader:org.apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:154) at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:330) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:198) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:159) at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:348) at org.apache.solr.cloud.ZkController.joinElection(ZkController.java:1075) at org.apache.solr.cloud.ZkController.register(ZkController.java:888) at org.apache.solr.cloud.ZkController$RegisterCoreAsync.call(ZkController.java:226) at java.util.concurrent.FutureTask.run(Unknown Source) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ERROR - 2015-10-26 11:29:45.757; [XXX shard6 ] org.apache.solr.common.SolrException; Error while trying to recover. core=XXX:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue/qn- at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:380) at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:377) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:377) at org.apache.solr.cloud.DistributedQueue.createData(DistributedQueue.java:380) at org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:364) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1219) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1129) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1125) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:348) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:229) WARN - 2015-10-26 11:32:03.116; [ XXX] org.apache.solr.cloud.ZkController$4; listener throws error org.apache.solr.common.SolrException: Unable to reload core [XXX] at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:743) at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2586) at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2376) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.<init>(SolrCore.java:815) at org.apache.solr.core.SolrCore.reload(SolrCore.java:455) at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:738) ... 2 more Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1657) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1769) at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:911) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:788) ... 4 more Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:718) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:732) at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:396) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86) at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:41) at org.apache.solr.core.SolrCore$2.call(SolrCore.java:904) at org.apache.solr.core.SolrCore$2.call(SolrCore.java:899) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1622) ... 7 more Caused by: java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(Unknown Source) at sun.nio.ch.FileChannelImpl.write(Unknown Source) at java.nio.channels.Channels.writeFullyImpl(Unknown Source) at java.nio.channels.Channels.writeFully(Unknown Source) at java.nio.channels.Channels.access$000(Unknown Source) at java.nio.channels.Channels$1.write(Unknown Source) at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:268) at java.util.zip.CheckedOutputStream.write(Unknown Source) at java.io.BufferedOutputStream.flushBuffer(Unknown Source) at java.io.BufferedOutputStream.flush(Unknown Source) at org.apache.lucene.store.OutputStreamIndexOutput.getChecksum(OutputStreamIndexOutput.java:80) at org.apache.lucene.codecs.CodecUtil.writeFooter(CodecUtil.java:310) at org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.close(Lucene50DocValuesConsumer.java:571) After restarting most of the nodes, we got to a point when only node 19 was in recovery (the first node which went down). However solr was still not working. Most nodes were receiving a lot of read timeout errors: ERROR - 2015-10-26 12:09:05.621; [XXX shard8 Node18 XXX] org.apache.solr.update.StreamingSolrClients$1; error java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:243) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Node 9 was listed as active on the cloud diagram, in the live_nodes list and was receiving no errors at all, while node 19 had two recurring errors: WARN - 2015-10-26 12:26:25.812; [ ] org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for core=XXX coreNodeName=Node19 ERROR - 2015-10-26 12:26:29.807; [ ] org.apache.solr.common.SolrException; Error while trying to recover. core=XXX:org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: XXXslice: shard9 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:684) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:670) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:325) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:229) Again there was nothing out the ordinary in the zookeeper logs. Restarting node 9 fixed the issue. My questions are: * This seems to be a zookeeper problem. Could a session timeout on one node, cause all other nodes to fail? * Why were there no issues flagged in the zookeeper logs? * After restarting the nodes, they experienced read timeout errors- what are these errors? * Why was node 9 listed as active when it wasn't- could this be related to the zookeeper issue? Any help would be appreciated, I'm at a loss as to what happened and I really don't want this to happen again. Kind regards Philippa