Hello - i stumbled upon an issue i've never seen earlier, a shard with all nodes up and running but no leader. This is on 4.10.4. One of the two nodes emits the following error log entry:
2015-03-08 05:25:49,095 WARN [solr.cloud.ElectionContext] - [Thread-136] - : cancelElection did not find election node to remove /overseer_elect/election/93434598784958483-178.21.116.225:8080_solr-n_0000000246 2015-03-08 05:25:49,121 WARN [solr.cloud.ElectionContext] - [Thread-136] - : cancelElection did not find election node to remove /collections/oi/leader_elect/shard3/election/93434598784958483-178.21.116.225:8080_solr_oi_h-n_0000000043 2015-03-08 05:25:49,220 ERROR [solr.update.UpdateLog] - [Thread-136] - : Error inspecting tlog tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001394 refcount=2} java.nio.channels.ClosedChannelException at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99) at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679) at org.apache.solr.update.ChannelFastInputStream.readWrappedStream(TransactionLog.java:784) at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89) at org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:125) at java.io.InputStream.read(InputStream.java:101) at org.apache.solr.update.TransactionLog.endsWithCommit(TransactionLog.java:218) at org.apache.solr.update.UpdateLog.recoverFromLog(UpdateLog.java:800) at org.apache.solr.cloud.ZkController.register(ZkController.java:841) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:277) at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166) 2015-03-08 05:25:49,225 ERROR [solr.update.UpdateLog] - [Thread-136] - : Error inspecting tlog tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001471 refcount=2} java.nio.channels.ClosedChannelException at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99) at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679) at org.apache.solr.update.ChannelFastInputStream.readWrappedStream(TransactionLog.java:784) at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89) at org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:125) at java.io.InputStream.read(InputStream.java:101) at org.apache.solr.update.TransactionLog.endsWithCommit(TransactionLog.java:218) at org.apache.solr.update.UpdateLog.recoverFromLog(UpdateLog.java:800) at org.apache.solr.cloud.ZkController.register(ZkController.java:841) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:277) at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166) 2015-03-08 12:21:04,438 WARN [solr.cloud.RecoveryStrategy] - [zkCallback-2-thread-28] - : Stopping recovery for core=oi_h coreNodeName=178.21.116.225:8080_solr_oi_h The other node makes a mess in the logs: 2015-03-08 05:25:46,020 WARN [solr.cloud.RecoveryStrategy] - [zkCallback-2-thread-20] - : Stopping recovery for core=oi_c coreNodeName=194.145.201.190: 8080_solr_oi_c 2015-03-08 05:26:08,670 ERROR [solr.cloud.ShardLeaderElectionContext] - [zkCallback-2-thread-19] - : There was a problem trying to register as the leader:org. apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:146) at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:317) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358) at org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:210) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.solr.common.SolrException: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /collections/openi ndex/leaders/shard3 at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:40) at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:134) ... 11 more Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /collections/oi/leaders/shard3 at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:463) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:74) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:460) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:417) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:404) at org.apache.solr.cloud.ShardLeaderElectionContextBase$1.execute(ElectionContext.java:138) at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:34) ... 12 more 2015-03-08 05:26:08,682 WARN [solr.cloud.ElectionContext] - [zkCallback-2-thread-19] - : cancelElection did not find election node to remove /collections/open index/leader_elect/shard3/election/93434598784958480-194.145.201.190:8080_solr_oi_c-n_0000000044 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover. core=oi_c:org.apache.solr.common.Solr Exception: No registered leader was found after waiting for 4000ms , collection: oi slice: shard3 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:568) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:551) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:330) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235) ... and then a continuous mess like below 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (0) core=oi_c 2015-03-08 05:26:26,703 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover. core=oi_c:org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: oi slice: shard3 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:568) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:551) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:330) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235) I haven't seen this before, the cluster was unavailable for indexing, search still worked as expected. Is there an open issue i can't seem to find now? Markus