Re: 6.4.0 collection leader election and recovery issues

Hendrik Haddorp Thu, 02 Feb 2017 04:59:09 -0800

Might be that your overseer queue overloaded. Similar to what isdescribed here:

https://support.lucidworks.com/hc/en-us/articles/203959903-Bringing-up-downed-Solr-servers-that-don-t-want-to-come-up


If the overseer queue gets too long you get hit by this:
https://github.com/Netflix/curator/wiki/Tech-Note-4

Try to request the overseer status(/solr/admin/collections?action=OVERSEERSTATUS). If that fails youlikely hit that problem. If so you can also not use the ZooKeepercommand line client anymore. You can now restart all your ZK nodes withan increases jute.maxbuffer value. Once ZK is restarted you can use theZK command line client with the same jute.maxbuffer value and check howmany entries /overseer/queue has in ZK. Normally there should be a fewentries but if you see thousands then you should delete them. I used afew lines of Java code for that, again setting jute.maxbuffer to thesame value. Once cleaned up restart the Solr nodes one by one and keepan eye on the overseer status.


On 02.02.2017 10:52, Ravi Solr wrote:

Following up on my previous email, the intermittent server unavailability
seems to be linked to the interaction between Solr and Zookeeper. Can
somebody help me understand what this error means and how to recover from
it.

2017-02-02 09:44:24.648 ERROR
(recoveryExecutor-3-thread-16-processing-n:xx.xxx.xxx.xxx:1234_solr
x:clicktrack_shard1_replica4 s:shard1 c:clicktrack r:core_node3)
[c:clicktrack s:shard1 r:core_node3 x:clicktrack_shard1_replica4]
o.a.s.c.RecoveryStrategy Error while trying to recover.
core=clicktrack_shard1_replica4:org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer/queue/qn-
     at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
     at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
     at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
     at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
     at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
     at
org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
     at
org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:244)
     at org.apache.solr.cloud.ZkController.publish(ZkController.java:1215)
     at org.apache.solr.cloud.ZkController.publish(ZkController.java:1128)
     at org.apache.solr.cloud.ZkController.publish(ZkController.java:1124)
     at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:334)
     at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
     at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
     at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
     at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     at java.lang.Thread.run(Thread.java:745)

Thanks

Ravi Kiran Bhaskar

On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solr <ravis...@gmail.com> wrote:

Hello,
          Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12
hours of debugging spree!! Can somebody kindly help me  out of this misery.

I have a set has 8 single shard collections with 3 replicas. As soon as I
updated the configs and started the servers one of my collection got stuck
with no leader. I have restarted solr to no avail, I also tried to force a
leader via collections API that dint work either. I also see that, from
time to time multiple solr nodes go down all at the same time, only a
restart resolves the issue.

The error snippets are shown below

2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n:
10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1
c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1
x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying
to recover. 
core=clicktrack_shard1_replica1:org.apache.solr.common.SolrException:
No registered leader was found after waiting for 4000ms , collection:
clicktrack slice: shard1

solr.log.9:2017-02-02 01:43:41.336 INFO  (zkCallback-4-thread-29-
processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
cluster state change: [WatchedEvent state:SyncConnected
type:NodeDataChanged path:/collections/clicktrack/state.json] for
collection [clicktrack] has occurred - updating... (live nodes size: [1])
solr.log.9:2017-02-02 01:43:42.224 INFO  (zkCallback-4-thread-29-
processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
cluster state change: [WatchedEvent state:SyncConnected
type:NodeDataChanged path:/collections/clicktrack/state.json] for
collection [clicktrack] has occurred - updating... (live nodes size: [1])
solr.log.9:2017-02-02 01:43:43.767 INFO  (zkCallback-4-thread-23-
processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
cluster state change: [WatchedEvent state:SyncConnected
type:NodeDataChanged path:/collections/clicktrack/state.json] for
collection [clicktrack] has occurred - updating... (live nodes size: [1])


Suspecting the worst I backed up the index and renamed the collection's
data folder and restarted the servers, this time the collection got a
proper leader. So is my index really corrupted ? Solr UI showed live nodes
just like the logs but without any leader. Even with the leader issue
somewhat alleviated after renaming the data folder and letting silr create
a new data folder my servers did go down a couple of times.

I am not all that well versed with zookeeper...any trick to make zookeeper
pick a leader and be happy ? Did anybody have solr/zookeeper issues with
6.4.0 ?

Thanks

Ravi Kiran Bhaskar

Re: 6.4.0 collection leader election and recovery issues

Reply via email to