Re: 4.10.4 - nodes up, shard without leader

Mark Miller Sun, 08 Mar 2015 18:41:56 -0700

Interesting bug.

First there is the already closed transaction log. That by itself deserves
a look. I'm not even positive we should be replaying the log we
reconnecting from ZK disconnect, but even if we do, this should never
happen.


Beyond that there seems to be some race. Because of the log trouble, we try
and cancel the election - but we don't find the ephemeral election node yet
for some reason and so just assume it's fine, no node there to remove
(well, we WARN, because it is a little unexpected). Then that ephemeral
node materializes I guess, and the new leader doesn't register because the
old leader won't give up the thrown. We don't try and force the new leader
because that may just hide bugs and cause data loss, we no leader is
elected.

I'd guess there are two JIRA issues to resolve here.

- Mark

On Sun, Mar 8, 2015 at 8:37 AM Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello - i stumbled upon an issue i've never seen earlier, a shard with all
> nodes up and running but no leader. This is on 4.10.4. One of the two nodes
> emits the following error log entry:
>
> 2015-03-08 05:25:49,095 WARN [solr.cloud.ElectionContext] - [Thread-136] -
> : cancelElection did not find election node to remove
> /overseer_elect/election/93434598784958483-178.21.116.
> 225:8080_solr-n_0000000246
> 2015-03-08 05:25:49,121 WARN [solr.cloud.ElectionContext] - [Thread-136] -
> : cancelElection did not find election node to remove
> /collections/oi/leader_elect/shard3/election/93434598784958483-178.21.116.
> 225:8080_solr_oi_h-n_0000000043
> 2015-03-08 05:25:49,220 ERROR [solr.update.UpdateLog] - [Thread-136] - :
> Error inspecting tlog 
> tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001394
> refcount=2}
> java.nio.channels.ClosedChannelException
>         at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99)
>         at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679)
>         at org.apache.solr.update.ChannelFastInputStream.
> readWrappedStream(TransactionLog.java:784)
>         at org.apache.solr.common.util.FastInputStream.refill(
> FastInputStream.java:89)
>         at org.apache.solr.common.util.FastInputStream.read(
> FastInputStream.java:125)
>         at java.io.InputStream.read(InputStream.java:101)
>         at org.apache.solr.update.TransactionLog.endsWithCommit(
> TransactionLog.java:218)
>         at org.apache.solr.update.UpdateLog.recoverFromLog(
> UpdateLog.java:800)
>         at org.apache.solr.cloud.ZkController.register(
> ZkController.java:841)
>         at org.apache.solr.cloud.ZkController$1.command(
> ZkController.java:277)
>         at org.apache.solr.common.cloud.ConnectionManager$1$1.run(
> ConnectionManager.java:166)
> 2015-03-08 05:25:49,225 ERROR [solr.update.UpdateLog] - [Thread-136] - :
> Error inspecting tlog 
> tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001471
> refcount=2}
> java.nio.channels.ClosedChannelException
>         at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99)
>         at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679)
>         at org.apache.solr.update.ChannelFastInputStream.
> readWrappedStream(TransactionLog.java:784)
>         at org.apache.solr.common.util.FastInputStream.refill(
> FastInputStream.java:89)
>         at org.apache.solr.common.util.FastInputStream.read(
> FastInputStream.java:125)
>         at java.io.InputStream.read(InputStream.java:101)
>         at org.apache.solr.update.TransactionLog.endsWithCommit(
> TransactionLog.java:218)
>         at org.apache.solr.update.UpdateLog.recoverFromLog(
> UpdateLog.java:800)
>         at org.apache.solr.cloud.ZkController.register(
> ZkController.java:841)
>         at org.apache.solr.cloud.ZkController$1.command(
> ZkController.java:277)
>         at org.apache.solr.common.cloud.ConnectionManager$1$1.run(
> ConnectionManager.java:166)
> 2015-03-08 12:21:04,438 WARN [solr.cloud.RecoveryStrategy] -
> [zkCallback-2-thread-28] - : Stopping recovery for core=oi_h coreNodeName=
> 178.21.116.225:8080_solr_oi_h
>
> The other node makes a mess in the logs:
>
> 2015-03-08 05:25:46,020 WARN [solr.cloud.RecoveryStrategy] -
> [zkCallback-2-thread-20] - : Stopping recovery for core=oi_c coreNodeName=
> 194.145.201.190:
> 8080_solr_oi_c
> 2015-03-08 05:26:08,670 ERROR [solr.cloud.ShardLeaderElectionContext] -
> [zkCallback-2-thread-19] - : There was a problem trying to register as the
> leader:org.
> apache.solr.common.SolrException: Could not register as the leader
> because creating the ephemeral registration node in ZooKeeper failed
>         at org.apache.solr.cloud.ShardLeaderElectionContextBase
> .runLeaderProcess(ElectionContext.java:146)
>         at org.apache.solr.cloud.ShardLeaderElectionContext.
> runLeaderProcess(ElectionContext.java:317)
>         at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(
> LeaderElector.java:163)
>         at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(
> LeaderElector.java:125)
>         at org.apache.solr.cloud.LeaderElector.access$200(
> LeaderElector.java:55)
>         at org.apache.solr.cloud.LeaderElector$ElectionWatcher.
> process(LeaderElector.java:358)
>         at org.apache.solr.common.cloud.SolrZkClient$3$1.run(
> SolrZkClient.java:210)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.solr.common.SolrException: org.apache.zookeeper.
> KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for
> /collections/openi
> ndex/leaders/shard3
>         at org.apache.solr.common.util.RetryUtil.retryOnThrowable(
> RetryUtil.java:40)
>         at org.apache.solr.cloud.ShardLeaderElectionContextBase
> .runLeaderProcess(ElectionContext.java:134)
>         ... 11 more
> Caused by: org.apache.zookeeper.KeeperException$NodeExistsException:
> KeeperErrorCode = NodeExists for /collections/oi/leaders/shard3
>         at org.apache.zookeeper.KeeperException.create(
> KeeperException.java:119)
>         at org.apache.zookeeper.KeeperException.create(
> KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>         at org.apache.solr.common.cloud.SolrZkClient$11.execute(
> SolrZkClient.java:463)
>         at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(
> ZkCmdExecutor.java:74)
>         at org.apache.solr.common.cloud.SolrZkClient.makePath(
> SolrZkClient.java:460)
>         at org.apache.solr.common.cloud.SolrZkClient.makePath(
> SolrZkClient.java:417)
>         at org.apache.solr.common.cloud.SolrZkClient.makePath(
> SolrZkClient.java:404)
>         at org.apache.solr.cloud.ShardLeaderElectionContextBase
> $1.execute(ElectionContext.java:138)
>         at org.apache.solr.common.util.RetryUtil.retryOnThrowable(
> RetryUtil.java:34)
>         ... 12 more
>
> 2015-03-08 05:26:08,682 WARN [solr.cloud.ElectionContext] -
> [zkCallback-2-thread-19] - : cancelElection did not find election node to
> remove /collections/open
> index/leader_elect/shard3/election/93434598784958480-194.145.201.190:8080
> _solr_oi_c-n_0000000044
> 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] -
> [RecoveryThread] - : Error while trying to recover.
> core=oi_c:org.apache.solr.common.Solr
> Exception: No registered leader was found after waiting for 4000ms ,
> collection: oi slice: shard3
>         at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(
> ZkStateReader.java:568)
>         at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(
> ZkStateReader.java:551)
>         at org.apache.solr.cloud.RecoveryStrategy.doRecovery(
> RecoveryStrategy.java:330)
>         at org.apache.solr.cloud.RecoveryStrategy.run(
> RecoveryStrategy.java:235)
>
> ... and then a continuous mess like below
>
> 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] -
> [RecoveryThread] - : Recovery failed - trying again... (0) core=oi_c
> 2015-03-08 05:26:26,703 ERROR [solr.cloud.RecoveryStrategy] -
> [RecoveryThread] - : Error while trying to recover.
> core=oi_c:org.apache.solr.common.SolrException: No registered leader was
> found after waiting for 4000ms , collection: oi slice: shard3
>         at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(
> ZkStateReader.java:568)
>         at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(
> ZkStateReader.java:551)
>         at org.apache.solr.cloud.RecoveryStrategy.doRecovery(
> RecoveryStrategy.java:330)
>         at org.apache.solr.cloud.RecoveryStrategy.run(
> RecoveryStrategy.java:235)
>
> I haven't seen this before, the cluster was unavailable for indexing,
> search still worked as expected. Is there an open issue i can't seem to
> find now?
>
> Markus
>

Re: 4.10.4 - nodes up, shard without leader

Reply via email to