Interesting bug. First there is the already closed transaction log. That by itself deserves a look. I'm not even positive we should be replaying the log we reconnecting from ZK disconnect, but even if we do, this should never happen.
Beyond that there seems to be some race. Because of the log trouble, we try and cancel the election - but we don't find the ephemeral election node yet for some reason and so just assume it's fine, no node there to remove (well, we WARN, because it is a little unexpected). Then that ephemeral node materializes I guess, and the new leader doesn't register because the old leader won't give up the thrown. We don't try and force the new leader because that may just hide bugs and cause data loss, we no leader is elected. I'd guess there are two JIRA issues to resolve here. - Mark On Sun, Mar 8, 2015 at 8:37 AM Markus Jelsma <markus.jel...@openindex.io> wrote: > Hello - i stumbled upon an issue i've never seen earlier, a shard with all > nodes up and running but no leader. This is on 4.10.4. One of the two nodes > emits the following error log entry: > > 2015-03-08 05:25:49,095 WARN [solr.cloud.ElectionContext] - [Thread-136] - > : cancelElection did not find election node to remove > /overseer_elect/election/93434598784958483-178.21.116. > 225:8080_solr-n_0000000246 > 2015-03-08 05:25:49,121 WARN [solr.cloud.ElectionContext] - [Thread-136] - > : cancelElection did not find election node to remove > /collections/oi/leader_elect/shard3/election/93434598784958483-178.21.116. > 225:8080_solr_oi_h-n_0000000043 > 2015-03-08 05:25:49,220 ERROR [solr.update.UpdateLog] - [Thread-136] - : > Error inspecting tlog > tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001394 > refcount=2} > java.nio.channels.ClosedChannelException > at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99) > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679) > at org.apache.solr.update.ChannelFastInputStream. > readWrappedStream(TransactionLog.java:784) > at org.apache.solr.common.util.FastInputStream.refill( > FastInputStream.java:89) > at org.apache.solr.common.util.FastInputStream.read( > FastInputStream.java:125) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.solr.update.TransactionLog.endsWithCommit( > TransactionLog.java:218) > at org.apache.solr.update.UpdateLog.recoverFromLog( > UpdateLog.java:800) > at org.apache.solr.cloud.ZkController.register( > ZkController.java:841) > at org.apache.solr.cloud.ZkController$1.command( > ZkController.java:277) > at org.apache.solr.common.cloud.ConnectionManager$1$1.run( > ConnectionManager.java:166) > 2015-03-08 05:25:49,225 ERROR [solr.update.UpdateLog] - [Thread-136] - : > Error inspecting tlog > tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001471 > refcount=2} > java.nio.channels.ClosedChannelException > at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99) > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679) > at org.apache.solr.update.ChannelFastInputStream. > readWrappedStream(TransactionLog.java:784) > at org.apache.solr.common.util.FastInputStream.refill( > FastInputStream.java:89) > at org.apache.solr.common.util.FastInputStream.read( > FastInputStream.java:125) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.solr.update.TransactionLog.endsWithCommit( > TransactionLog.java:218) > at org.apache.solr.update.UpdateLog.recoverFromLog( > UpdateLog.java:800) > at org.apache.solr.cloud.ZkController.register( > ZkController.java:841) > at org.apache.solr.cloud.ZkController$1.command( > ZkController.java:277) > at org.apache.solr.common.cloud.ConnectionManager$1$1.run( > ConnectionManager.java:166) > 2015-03-08 12:21:04,438 WARN [solr.cloud.RecoveryStrategy] - > [zkCallback-2-thread-28] - : Stopping recovery for core=oi_h coreNodeName= > 178.21.116.225:8080_solr_oi_h > > The other node makes a mess in the logs: > > 2015-03-08 05:25:46,020 WARN [solr.cloud.RecoveryStrategy] - > [zkCallback-2-thread-20] - : Stopping recovery for core=oi_c coreNodeName= > 194.145.201.190: > 8080_solr_oi_c > 2015-03-08 05:26:08,670 ERROR [solr.cloud.ShardLeaderElectionContext] - > [zkCallback-2-thread-19] - : There was a problem trying to register as the > leader:org. > apache.solr.common.SolrException: Could not register as the leader > because creating the ephemeral registration node in ZooKeeper failed > at org.apache.solr.cloud.ShardLeaderElectionContextBase > .runLeaderProcess(ElectionContext.java:146) > at org.apache.solr.cloud.ShardLeaderElectionContext. > runLeaderProcess(ElectionContext.java:317) > at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess( > LeaderElector.java:163) > at org.apache.solr.cloud.LeaderElector.checkIfIamLeader( > LeaderElector.java:125) > at org.apache.solr.cloud.LeaderElector.access$200( > LeaderElector.java:55) > at org.apache.solr.cloud.LeaderElector$ElectionWatcher. > process(LeaderElector.java:358) > at org.apache.solr.common.cloud.SolrZkClient$3$1.run( > SolrZkClient.java:210) > at java.util.concurrent.Executors$RunnableAdapter. > call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.solr.common.SolrException: org.apache.zookeeper. > KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for > /collections/openi > ndex/leaders/shard3 > at org.apache.solr.common.util.RetryUtil.retryOnThrowable( > RetryUtil.java:40) > at org.apache.solr.cloud.ShardLeaderElectionContextBase > .runLeaderProcess(ElectionContext.java:134) > ... 11 more > Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: > KeeperErrorCode = NodeExists for /collections/oi/leaders/shard3 > at org.apache.zookeeper.KeeperException.create( > KeeperException.java:119) > at org.apache.zookeeper.KeeperException.create( > KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) > at org.apache.solr.common.cloud.SolrZkClient$11.execute( > SolrZkClient.java:463) > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation( > ZkCmdExecutor.java:74) > at org.apache.solr.common.cloud.SolrZkClient.makePath( > SolrZkClient.java:460) > at org.apache.solr.common.cloud.SolrZkClient.makePath( > SolrZkClient.java:417) > at org.apache.solr.common.cloud.SolrZkClient.makePath( > SolrZkClient.java:404) > at org.apache.solr.cloud.ShardLeaderElectionContextBase > $1.execute(ElectionContext.java:138) > at org.apache.solr.common.util.RetryUtil.retryOnThrowable( > RetryUtil.java:34) > ... 12 more > > 2015-03-08 05:26:08,682 WARN [solr.cloud.ElectionContext] - > [zkCallback-2-thread-19] - : cancelElection did not find election node to > remove /collections/open > index/leader_elect/shard3/election/93434598784958480-194.145.201.190:8080 > _solr_oi_c-n_0000000044 > 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] - > [RecoveryThread] - : Error while trying to recover. > core=oi_c:org.apache.solr.common.Solr > Exception: No registered leader was found after waiting for 4000ms , > collection: oi slice: shard3 > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > ZkStateReader.java:568) > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > ZkStateReader.java:551) > at org.apache.solr.cloud.RecoveryStrategy.doRecovery( > RecoveryStrategy.java:330) > at org.apache.solr.cloud.RecoveryStrategy.run( > RecoveryStrategy.java:235) > > ... and then a continuous mess like below > > 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] - > [RecoveryThread] - : Recovery failed - trying again... (0) core=oi_c > 2015-03-08 05:26:26,703 ERROR [solr.cloud.RecoveryStrategy] - > [RecoveryThread] - : Error while trying to recover. > core=oi_c:org.apache.solr.common.SolrException: No registered leader was > found after waiting for 4000ms , collection: oi slice: shard3 > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > ZkStateReader.java:568) > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > ZkStateReader.java:551) > at org.apache.solr.cloud.RecoveryStrategy.doRecovery( > RecoveryStrategy.java:330) > at org.apache.solr.cloud.RecoveryStrategy.run( > RecoveryStrategy.java:235) > > I haven't seen this before, the cluster was unavailable for indexing, > search still worked as expected. Is there an open issue i can't seem to > find now? > > Markus >