Re: 4.10.4 - nodes up, shard without leader

Ramkumar R. Aiyengar Mon, 09 Mar 2015 03:16:56 -0700

The update log replay issue looks like
https://issues.apache.org/jira/browse/SOLR-6583
On 9 Mar 2015 01:41, "Mark Miller" <markrmil...@gmail.com> wrote:


> Interesting bug.
>
> First there is the already closed transaction log. That by itself deserves
> a look. I'm not even positive we should be replaying the log we
> reconnecting from ZK disconnect, but even if we do, this should never
> happen.
>
> Beyond that there seems to be some race. Because of the log trouble, we try
> and cancel the election - but we don't find the ephemeral election node yet
> for some reason and so just assume it's fine, no node there to remove
> (well, we WARN, because it is a little unexpected). Then that ephemeral
> node materializes I guess, and the new leader doesn't register because the
> old leader won't give up the thrown. We don't try and force the new leader
> because that may just hide bugs and cause data loss, we no leader is
> elected.
>
> I'd guess there are two JIRA issues to resolve here.
>
> - Mark
>
> On Sun, Mar 8, 2015 at 8:37 AM Markus Jelsma <markus.jel...@openindex.io>
> wrote:
>
> > Hello - i stumbled upon an issue i've never seen earlier, a shard with
> all
> > nodes up and running but no leader. This is on 4.10.4. One of the two
> nodes
> > emits the following error log entry:
> >
> > 2015-03-08 05:25:49,095 WARN [solr.cloud.ElectionContext] - [Thread-136]
> -
> > : cancelElection did not find election node to remove
> > /overseer_elect/election/93434598784958483-178.21.116.
> > 225:8080_solr-n_0000000246
> > 2015-03-08 05:25:49,121 WARN [solr.cloud.ElectionContext] - [Thread-136]
> -
> > : cancelElection did not find election node to remove
> >
> /collections/oi/leader_elect/shard3/election/93434598784958483-178.21.116.
> > 225:8080_solr_oi_h-n_0000000043
> > 2015-03-08 05:25:49,220 ERROR [solr.update.UpdateLog] - [Thread-136] - :
> > Error inspecting tlog
> tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001394
> > refcount=2}
> > java.nio.channels.ClosedChannelException
> >         at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99)
> >         at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679)
> >         at org.apache.solr.update.ChannelFastInputStream.
> > readWrappedStream(TransactionLog.java:784)
> >         at org.apache.solr.common.util.FastInputStream.refill(
> > FastInputStream.java:89)
> >         at org.apache.solr.common.util.FastInputStream.read(
> > FastInputStream.java:125)
> >         at java.io.InputStream.read(InputStream.java:101)
> >         at org.apache.solr.update.TransactionLog.endsWithCommit(
> > TransactionLog.java:218)
> >         at org.apache.solr.update.UpdateLog.recoverFromLog(
> > UpdateLog.java:800)
> >         at org.apache.solr.cloud.ZkController.register(
> > ZkController.java:841)
> >         at org.apache.solr.cloud.ZkController$1.command(
> > ZkController.java:277)
> >         at org.apache.solr.common.cloud.ConnectionManager$1$1.run(
> > ConnectionManager.java:166)
> > 2015-03-08 05:25:49,225 ERROR [solr.update.UpdateLog] - [Thread-136] - :
> > Error inspecting tlog
> tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001471
> > refcount=2}
> > java.nio.channels.ClosedChannelException
> >         at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99)
> >         at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679)
> >         at org.apache.solr.update.ChannelFastInputStream.
> > readWrappedStream(TransactionLog.java:784)
> >         at org.apache.solr.common.util.FastInputStream.refill(
> > FastInputStream.java:89)
> >         at org.apache.solr.common.util.FastInputStream.read(
> > FastInputStream.java:125)
> >         at java.io.InputStream.read(InputStream.java:101)
> >         at org.apache.solr.update.TransactionLog.endsWithCommit(
> > TransactionLog.java:218)
> >         at org.apache.solr.update.UpdateLog.recoverFromLog(
> > UpdateLog.java:800)
> >         at org.apache.solr.cloud.ZkController.register(
> > ZkController.java:841)
> >         at org.apache.solr.cloud.ZkController$1.command(
> > ZkController.java:277)
> >         at org.apache.solr.common.cloud.ConnectionManager$1$1.run(
> > ConnectionManager.java:166)
> > 2015-03-08 12:21:04,438 WARN [solr.cloud.RecoveryStrategy] -
> > [zkCallback-2-thread-28] - : Stopping recovery for core=oi_h
> coreNodeName=
> > 178.21.116.225:8080_solr_oi_h
> >
> > The other node makes a mess in the logs:
> >
> > 2015-03-08 05:25:46,020 WARN [solr.cloud.RecoveryStrategy] -
> > [zkCallback-2-thread-20] - : Stopping recovery for core=oi_c
> coreNodeName=
> > 194.145.201.190:
> > 8080_solr_oi_c
> > 2015-03-08 05:26:08,670 ERROR [solr.cloud.ShardLeaderElectionContext] -
> > [zkCallback-2-thread-19] - : There was a problem trying to register as
> the
> > leader:org.
> > apache.solr.common.SolrException: Could not register as the leader
> > because creating the ephemeral registration node in ZooKeeper failed
> >         at org.apache.solr.cloud.ShardLeaderElectionContextBase
> > .runLeaderProcess(ElectionContext.java:146)
> >         at org.apache.solr.cloud.ShardLeaderElectionContext.
> > runLeaderProcess(ElectionContext.java:317)
> >         at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(
> > LeaderElector.java:163)
> >         at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(
> > LeaderElector.java:125)
> >         at org.apache.solr.cloud.LeaderElector.access$200(
> > LeaderElector.java:55)
> >         at org.apache.solr.cloud.LeaderElector$ElectionWatcher.
> > process(LeaderElector.java:358)
> >         at org.apache.solr.common.cloud.SolrZkClient$3$1.run(
> > SolrZkClient.java:210)
> >         at java.util.concurrent.Executors$RunnableAdapter.
> > call(Executors.java:471)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1145)
> >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:615)
> >         at java.lang.Thread.run(Thread.java:745)
> > Caused by: org.apache.solr.common.SolrException: org.apache.zookeeper.
> > KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for
> > /collections/openi
> > ndex/leaders/shard3
> >         at org.apache.solr.common.util.RetryUtil.retryOnThrowable(
> > RetryUtil.java:40)
> >         at org.apache.solr.cloud.ShardLeaderElectionContextBase
> > .runLeaderProcess(ElectionContext.java:134)
> >         ... 11 more
> > Caused by: org.apache.zookeeper.KeeperException$NodeExistsException:
> > KeeperErrorCode = NodeExists for /collections/oi/leaders/shard3
> >         at org.apache.zookeeper.KeeperException.create(
> > KeeperException.java:119)
> >         at org.apache.zookeeper.KeeperException.create(
> > KeeperException.java:51)
> >         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
> >         at org.apache.solr.common.cloud.SolrZkClient$11.execute(
> > SolrZkClient.java:463)
> >         at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(
> > ZkCmdExecutor.java:74)
> >         at org.apache.solr.common.cloud.SolrZkClient.makePath(
> > SolrZkClient.java:460)
> >         at org.apache.solr.common.cloud.SolrZkClient.makePath(
> > SolrZkClient.java:417)
> >         at org.apache.solr.common.cloud.SolrZkClient.makePath(
> > SolrZkClient.java:404)
> >         at org.apache.solr.cloud.ShardLeaderElectionContextBase
> > $1.execute(ElectionContext.java:138)
> >         at org.apache.solr.common.util.RetryUtil.retryOnThrowable(
> > RetryUtil.java:34)
> >         ... 12 more
> >
> > 2015-03-08 05:26:08,682 WARN [solr.cloud.ElectionContext] -
> > [zkCallback-2-thread-19] - : cancelElection did not find election node to
> > remove /collections/open
> > index/leader_elect/shard3/election/93434598784958480-194.145.201.190:8080
> > _solr_oi_c-n_0000000044
> > 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] -
> > [RecoveryThread] - : Error while trying to recover.
> > core=oi_c:org.apache.solr.common.Solr
> > Exception: No registered leader was found after waiting for 4000ms ,
> > collection: oi slice: shard3
> >         at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(
> > ZkStateReader.java:568)
> >         at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(
> > ZkStateReader.java:551)
> >         at org.apache.solr.cloud.RecoveryStrategy.doRecovery(
> > RecoveryStrategy.java:330)
> >         at org.apache.solr.cloud.RecoveryStrategy.run(
> > RecoveryStrategy.java:235)
> >
> > ... and then a continuous mess like below
> >
> > 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] -
> > [RecoveryThread] - : Recovery failed - trying again... (0) core=oi_c
> > 2015-03-08 05:26:26,703 ERROR [solr.cloud.RecoveryStrategy] -
> > [RecoveryThread] - : Error while trying to recover.
> > core=oi_c:org.apache.solr.common.SolrException: No registered leader was
> > found after waiting for 4000ms , collection: oi slice: shard3
> >         at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(
> > ZkStateReader.java:568)
> >         at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(
> > ZkStateReader.java:551)
> >         at org.apache.solr.cloud.RecoveryStrategy.doRecovery(
> > RecoveryStrategy.java:330)
> >         at org.apache.solr.cloud.RecoveryStrategy.run(
> > RecoveryStrategy.java:235)
> >
> > I haven't seen this before, the cluster was unavailable for indexing,
> > search still worked as expected. Is there an open issue i can't seem to
> > find now?
> >
> > Markus
> >
>

Re: 4.10.4 - nodes up, shard without leader

Reply via email to