The update log replay issue looks like https://issues.apache.org/jira/browse/SOLR-6583 On 9 Mar 2015 01:41, "Mark Miller" <markrmil...@gmail.com> wrote:
> Interesting bug. > > First there is the already closed transaction log. That by itself deserves > a look. I'm not even positive we should be replaying the log we > reconnecting from ZK disconnect, but even if we do, this should never > happen. > > Beyond that there seems to be some race. Because of the log trouble, we try > and cancel the election - but we don't find the ephemeral election node yet > for some reason and so just assume it's fine, no node there to remove > (well, we WARN, because it is a little unexpected). Then that ephemeral > node materializes I guess, and the new leader doesn't register because the > old leader won't give up the thrown. We don't try and force the new leader > because that may just hide bugs and cause data loss, we no leader is > elected. > > I'd guess there are two JIRA issues to resolve here. > > - Mark > > On Sun, Mar 8, 2015 at 8:37 AM Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > Hello - i stumbled upon an issue i've never seen earlier, a shard with > all > > nodes up and running but no leader. This is on 4.10.4. One of the two > nodes > > emits the following error log entry: > > > > 2015-03-08 05:25:49,095 WARN [solr.cloud.ElectionContext] - [Thread-136] > - > > : cancelElection did not find election node to remove > > /overseer_elect/election/93434598784958483-178.21.116. > > 225:8080_solr-n_0000000246 > > 2015-03-08 05:25:49,121 WARN [solr.cloud.ElectionContext] - [Thread-136] > - > > : cancelElection did not find election node to remove > > > /collections/oi/leader_elect/shard3/election/93434598784958483-178.21.116. > > 225:8080_solr_oi_h-n_0000000043 > > 2015-03-08 05:25:49,220 ERROR [solr.update.UpdateLog] - [Thread-136] - : > > Error inspecting tlog > tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001394 > > refcount=2} > > java.nio.channels.ClosedChannelException > > at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99) > > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679) > > at org.apache.solr.update.ChannelFastInputStream. > > readWrappedStream(TransactionLog.java:784) > > at org.apache.solr.common.util.FastInputStream.refill( > > FastInputStream.java:89) > > at org.apache.solr.common.util.FastInputStream.read( > > FastInputStream.java:125) > > at java.io.InputStream.read(InputStream.java:101) > > at org.apache.solr.update.TransactionLog.endsWithCommit( > > TransactionLog.java:218) > > at org.apache.solr.update.UpdateLog.recoverFromLog( > > UpdateLog.java:800) > > at org.apache.solr.cloud.ZkController.register( > > ZkController.java:841) > > at org.apache.solr.cloud.ZkController$1.command( > > ZkController.java:277) > > at org.apache.solr.common.cloud.ConnectionManager$1$1.run( > > ConnectionManager.java:166) > > 2015-03-08 05:25:49,225 ERROR [solr.update.UpdateLog] - [Thread-136] - : > > Error inspecting tlog > tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001471 > > refcount=2} > > java.nio.channels.ClosedChannelException > > at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99) > > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679) > > at org.apache.solr.update.ChannelFastInputStream. > > readWrappedStream(TransactionLog.java:784) > > at org.apache.solr.common.util.FastInputStream.refill( > > FastInputStream.java:89) > > at org.apache.solr.common.util.FastInputStream.read( > > FastInputStream.java:125) > > at java.io.InputStream.read(InputStream.java:101) > > at org.apache.solr.update.TransactionLog.endsWithCommit( > > TransactionLog.java:218) > > at org.apache.solr.update.UpdateLog.recoverFromLog( > > UpdateLog.java:800) > > at org.apache.solr.cloud.ZkController.register( > > ZkController.java:841) > > at org.apache.solr.cloud.ZkController$1.command( > > ZkController.java:277) > > at org.apache.solr.common.cloud.ConnectionManager$1$1.run( > > ConnectionManager.java:166) > > 2015-03-08 12:21:04,438 WARN [solr.cloud.RecoveryStrategy] - > > [zkCallback-2-thread-28] - : Stopping recovery for core=oi_h > coreNodeName= > > 178.21.116.225:8080_solr_oi_h > > > > The other node makes a mess in the logs: > > > > 2015-03-08 05:25:46,020 WARN [solr.cloud.RecoveryStrategy] - > > [zkCallback-2-thread-20] - : Stopping recovery for core=oi_c > coreNodeName= > > 194.145.201.190: > > 8080_solr_oi_c > > 2015-03-08 05:26:08,670 ERROR [solr.cloud.ShardLeaderElectionContext] - > > [zkCallback-2-thread-19] - : There was a problem trying to register as > the > > leader:org. > > apache.solr.common.SolrException: Could not register as the leader > > because creating the ephemeral registration node in ZooKeeper failed > > at org.apache.solr.cloud.ShardLeaderElectionContextBase > > .runLeaderProcess(ElectionContext.java:146) > > at org.apache.solr.cloud.ShardLeaderElectionContext. > > runLeaderProcess(ElectionContext.java:317) > > at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess( > > LeaderElector.java:163) > > at org.apache.solr.cloud.LeaderElector.checkIfIamLeader( > > LeaderElector.java:125) > > at org.apache.solr.cloud.LeaderElector.access$200( > > LeaderElector.java:55) > > at org.apache.solr.cloud.LeaderElector$ElectionWatcher. > > process(LeaderElector.java:358) > > at org.apache.solr.common.cloud.SolrZkClient$3$1.run( > > SolrZkClient.java:210) > > at java.util.concurrent.Executors$RunnableAdapter. > > call(Executors.java:471) > > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > > at java.util.concurrent.ThreadPoolExecutor.runWorker( > > ThreadPoolExecutor.java:1145) > > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > > ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: org.apache.solr.common.SolrException: org.apache.zookeeper. > > KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for > > /collections/openi > > ndex/leaders/shard3 > > at org.apache.solr.common.util.RetryUtil.retryOnThrowable( > > RetryUtil.java:40) > > at org.apache.solr.cloud.ShardLeaderElectionContextBase > > .runLeaderProcess(ElectionContext.java:134) > > ... 11 more > > Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: > > KeeperErrorCode = NodeExists for /collections/oi/leaders/shard3 > > at org.apache.zookeeper.KeeperException.create( > > KeeperException.java:119) > > at org.apache.zookeeper.KeeperException.create( > > KeeperException.java:51) > > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) > > at org.apache.solr.common.cloud.SolrZkClient$11.execute( > > SolrZkClient.java:463) > > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation( > > ZkCmdExecutor.java:74) > > at org.apache.solr.common.cloud.SolrZkClient.makePath( > > SolrZkClient.java:460) > > at org.apache.solr.common.cloud.SolrZkClient.makePath( > > SolrZkClient.java:417) > > at org.apache.solr.common.cloud.SolrZkClient.makePath( > > SolrZkClient.java:404) > > at org.apache.solr.cloud.ShardLeaderElectionContextBase > > $1.execute(ElectionContext.java:138) > > at org.apache.solr.common.util.RetryUtil.retryOnThrowable( > > RetryUtil.java:34) > > ... 12 more > > > > 2015-03-08 05:26:08,682 WARN [solr.cloud.ElectionContext] - > > [zkCallback-2-thread-19] - : cancelElection did not find election node to > > remove /collections/open > > index/leader_elect/shard3/election/93434598784958480-194.145.201.190:8080 > > _solr_oi_c-n_0000000044 > > 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] - > > [RecoveryThread] - : Error while trying to recover. > > core=oi_c:org.apache.solr.common.Solr > > Exception: No registered leader was found after waiting for 4000ms , > > collection: oi slice: shard3 > > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > > ZkStateReader.java:568) > > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > > ZkStateReader.java:551) > > at org.apache.solr.cloud.RecoveryStrategy.doRecovery( > > RecoveryStrategy.java:330) > > at org.apache.solr.cloud.RecoveryStrategy.run( > > RecoveryStrategy.java:235) > > > > ... and then a continuous mess like below > > > > 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] - > > [RecoveryThread] - : Recovery failed - trying again... (0) core=oi_c > > 2015-03-08 05:26:26,703 ERROR [solr.cloud.RecoveryStrategy] - > > [RecoveryThread] - : Error while trying to recover. > > core=oi_c:org.apache.solr.common.SolrException: No registered leader was > > found after waiting for 4000ms , collection: oi slice: shard3 > > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > > ZkStateReader.java:568) > > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > > ZkStateReader.java:551) > > at org.apache.solr.cloud.RecoveryStrategy.doRecovery( > > RecoveryStrategy.java:330) > > at org.apache.solr.cloud.RecoveryStrategy.run( > > RecoveryStrategy.java:235) > > > > I haven't seen this before, the cluster was unavailable for indexing, > > search still worked as expected. Is there an open issue i can't seem to > > find now? > > > > Markus > > >