Might be that your overseer queue overloaded. Similar to what is
described here:
https://support.lucidworks.com/hc/en-us/articles/203959903-Bringing-up-downed-Solr-servers-that-don-t-want-to-come-up
If the overseer queue gets too long you get hit by this:
https://github.com/Netflix/curator/wiki/Tech-Note-4
Try to request the overseer status
(/solr/admin/collections?action=OVERSEERSTATUS). If that fails you
likely hit that problem. If so you can also not use the ZooKeeper
command line client anymore. You can now restart all your ZK nodes with
an increases jute.maxbuffer value. Once ZK is restarted you can use the
ZK command line client with the same jute.maxbuffer value and check how
many entries /overseer/queue has in ZK. Normally there should be a few
entries but if you see thousands then you should delete them. I used a
few lines of Java code for that, again setting jute.maxbuffer to the
same value. Once cleaned up restart the Solr nodes one by one and keep
an eye on the overseer status.
On 02.02.2017 10:52, Ravi Solr wrote:
Following up on my previous email, the intermittent server unavailability
seems to be linked to the interaction between Solr and Zookeeper. Can
somebody help me understand what this error means and how to recover from
it.
2017-02-02 09:44:24.648 ERROR
(recoveryExecutor-3-thread-16-processing-n:xx.xxx.xxx.xxx:1234_solr
x:clicktrack_shard1_replica4 s:shard1 c:clicktrack r:core_node3)
[c:clicktrack s:shard1 r:core_node3 x:clicktrack_shard1_replica4]
o.a.s.c.RecoveryStrategy Error while trying to recover.
core=clicktrack_shard1_replica4:org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer/queue/qn-
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
at
org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
at
org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:244)
at org.apache.solr.cloud.ZkController.publish(ZkController.java:1215)
at org.apache.solr.cloud.ZkController.publish(ZkController.java:1128)
at org.apache.solr.cloud.ZkController.publish(ZkController.java:1124)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:334)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Thanks
Ravi Kiran Bhaskar
On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solr <ravis...@gmail.com> wrote:
Hello,
Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12
hours of debugging spree!! Can somebody kindly help me out of this misery.
I have a set has 8 single shard collections with 3 replicas. As soon as I
updated the configs and started the servers one of my collection got stuck
with no leader. I have restarted solr to no avail, I also tried to force a
leader via collections API that dint work either. I also see that, from
time to time multiple solr nodes go down all at the same time, only a
restart resolves the issue.
The error snippets are shown below
2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n:
10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1
c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1
x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying
to recover.
core=clicktrack_shard1_replica1:org.apache.solr.common.SolrException:
No registered leader was found after waiting for 4000ms , collection:
clicktrack slice: shard1
solr.log.9:2017-02-02 01:43:41.336 INFO (zkCallback-4-thread-29-
processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A
cluster state change: [WatchedEvent state:SyncConnected
type:NodeDataChanged path:/collections/clicktrack/state.json] for
collection [clicktrack] has occurred - updating... (live nodes size: [1])
solr.log.9:2017-02-02 01:43:42.224 INFO (zkCallback-4-thread-29-
processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A
cluster state change: [WatchedEvent state:SyncConnected
type:NodeDataChanged path:/collections/clicktrack/state.json] for
collection [clicktrack] has occurred - updating... (live nodes size: [1])
solr.log.9:2017-02-02 01:43:43.767 INFO (zkCallback-4-thread-23-
processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A
cluster state change: [WatchedEvent state:SyncConnected
type:NodeDataChanged path:/collections/clicktrack/state.json] for
collection [clicktrack] has occurred - updating... (live nodes size: [1])
Suspecting the worst I backed up the index and renamed the collection's
data folder and restarted the servers, this time the collection got a
proper leader. So is my index really corrupted ? Solr UI showed live nodes
just like the logs but without any leader. Even with the leader issue
somewhat alleviated after renaming the data folder and letting silr create
a new data folder my servers did go down a couple of times.
I am not all that well versed with zookeeper...any trick to make zookeeper
pick a leader and be happy ? Did anybody have solr/zookeeper issues with
6.4.0 ?
Thanks
Ravi Kiran Bhaskar