Thanks Hendrik. Iam baffled as to why I did not hit this issue prior to moving to 6.4.0.
On Thu, Feb 2, 2017 at 7:58 AM, Hendrik Haddorp <hendrik.hadd...@gmx.net> wrote: > Might be that your overseer queue overloaded. Similar to what is described > here: > https://support.lucidworks.com/hc/en-us/articles/203959903- > Bringing-up-downed-Solr-servers-that-don-t-want-to-come-up > > If the overseer queue gets too long you get hit by this: > https://github.com/Netflix/curator/wiki/Tech-Note-4 > > Try to request the overseer status > (/solr/admin/collections?action=OVERSEERSTATUS). > If that fails you likely hit that problem. If so you can also not use the > ZooKeeper command line client anymore. You can now restart all your ZK > nodes with an increases jute.maxbuffer value. Once ZK is restarted you can > use the ZK command line client with the same jute.maxbuffer value and check > how many entries /overseer/queue has in ZK. Normally there should be a few > entries but if you see thousands then you should delete them. I used a few > lines of Java code for that, again setting jute.maxbuffer to the same > value. Once cleaned up restart the Solr nodes one by one and keep an eye on > the overseer status. > > > On 02.02.2017 10:52, Ravi Solr wrote: > >> Following up on my previous email, the intermittent server unavailability >> seems to be linked to the interaction between Solr and Zookeeper. Can >> somebody help me understand what this error means and how to recover from >> it. >> >> 2017-02-02 09:44:24.648 ERROR >> (recoveryExecutor-3-thread-16-processing-n:xx.xxx.xxx.xxx:1234_solr >> x:clicktrack_shard1_replica4 s:shard1 c:clicktrack r:core_node3) >> [c:clicktrack s:shard1 r:core_node3 x:clicktrack_shard1_replica4] >> o.a.s.c.RecoveryStrategy Error while trying to recover. >> core=clicktrack_shard1_replica4:org.apache.zookeeper.KeeperE >> xception$SessionExpiredException: >> KeeperErrorCode = Session expired for /overseer/queue/qn- >> at org.apache.zookeeper.KeeperException.create(KeeperException. >> java:127) >> at org.apache.zookeeper.KeeperException.create(KeeperException. >> java:51) >> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) >> at >> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkCl >> ient.java:391) >> at >> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkCl >> ient.java:388) >> at >> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk >> CmdExecutor.java:60) >> at >> org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388) >> at >> org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:244) >> at org.apache.solr.cloud.ZkController.publish(ZkController. >> java:1215) >> at org.apache.solr.cloud.ZkController.publish(ZkController. >> java:1128) >> at org.apache.solr.cloud.ZkController.publish(ZkController. >> java:1124) >> at >> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoverySt >> rategy.java:334) >> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy. >> java:222) >> at >> com.codahale.metrics.InstrumentedExecutorService$Instrumente >> dRunnable.run(InstrumentedExecutorService.java:176) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE >> xecutor.lambda$execute$0(ExecutorUtil.java:229) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >> Executor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >> lExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> Thanks >> >> Ravi Kiran Bhaskar >> >> On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solr <ravis...@gmail.com> wrote: >> >> Hello, >>> Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12 >>> hours of debugging spree!! Can somebody kindly help me out of this >>> misery. >>> >>> I have a set has 8 single shard collections with 3 replicas. As soon as I >>> updated the configs and started the servers one of my collection got >>> stuck >>> with no leader. I have restarted solr to no avail, I also tried to force >>> a >>> leader via collections API that dint work either. I also see that, from >>> time to time multiple solr nodes go down all at the same time, only a >>> restart resolves the issue. >>> >>> The error snippets are shown below >>> >>> 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n: >>> 10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1 >>> c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1 >>> x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying >>> to recover. core=clicktrack_shard1_replica1:org.apache.solr.common. >>> SolrException: >>> No registered leader was found after waiting for 4000ms , collection: >>> clicktrack slice: shard1 >>> >>> solr.log.9:2017-02-02 01:43:41.336 INFO (zkCallback-4-thread-29- >>> processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A >>> cluster state change: [WatchedEvent state:SyncConnected >>> type:NodeDataChanged path:/collections/clicktrack/state.json] for >>> collection [clicktrack] has occurred - updating... (live nodes size: [1]) >>> solr.log.9:2017-02-02 01:43:42.224 INFO (zkCallback-4-thread-29- >>> processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A >>> cluster state change: [WatchedEvent state:SyncConnected >>> type:NodeDataChanged path:/collections/clicktrack/state.json] for >>> collection [clicktrack] has occurred - updating... (live nodes size: [1]) >>> solr.log.9:2017-02-02 01:43:43.767 INFO (zkCallback-4-thread-23- >>> processing-n:10.128.159.245:9001_solr) [ ] o.a.s.c.c.ZkStateReader A >>> cluster state change: [WatchedEvent state:SyncConnected >>> type:NodeDataChanged path:/collections/clicktrack/state.json] for >>> collection [clicktrack] has occurred - updating... (live nodes size: [1]) >>> >>> >>> Suspecting the worst I backed up the index and renamed the collection's >>> data folder and restarted the servers, this time the collection got a >>> proper leader. So is my index really corrupted ? Solr UI showed live >>> nodes >>> just like the logs but without any leader. Even with the leader issue >>> somewhat alleviated after renaming the data folder and letting silr >>> create >>> a new data folder my servers did go down a couple of times. >>> >>> I am not all that well versed with zookeeper...any trick to make >>> zookeeper >>> pick a leader and be happy ? Did anybody have solr/zookeeper issues with >>> 6.4.0 ? >>> >>> Thanks >>> >>> Ravi Kiran Bhaskar >>> >>> >