Re: 6.4.0 collection leader election and recovery issues

Ravi Solr Thu, 02 Feb 2017 06:25:22 -0800

Thanks Hendrik. Iam baffled as to why I did not hit this issue prior to
moving to 6.4.0.


On Thu, Feb 2, 2017 at 7:58 AM, Hendrik Haddorp <hendrik.hadd...@gmx.net>
wrote:

> Might be that your overseer queue overloaded. Similar to what is described
> here:
> https://support.lucidworks.com/hc/en-us/articles/203959903-
> Bringing-up-downed-Solr-servers-that-don-t-want-to-come-up
>
> If the overseer queue gets too long you get hit by this:
> https://github.com/Netflix/curator/wiki/Tech-Note-4
>
> Try to request the overseer status 
> (/solr/admin/collections?action=OVERSEERSTATUS).
> If that fails you likely hit that problem. If so you can also not use the
> ZooKeeper command line client anymore. You can now restart all your ZK
> nodes with an increases jute.maxbuffer value. Once ZK is restarted you can
> use the ZK command line client with the same jute.maxbuffer value and check
> how many entries /overseer/queue has in ZK. Normally there should be a few
> entries but if you see thousands then you should delete them. I used a few
> lines of Java code for that, again setting jute.maxbuffer to the same
> value. Once cleaned up restart the Solr nodes one by one and keep an eye on
> the overseer status.
>
>
> On 02.02.2017 10:52, Ravi Solr wrote:
>
>> Following up on my previous email, the intermittent server unavailability
>> seems to be linked to the interaction between Solr and Zookeeper. Can
>> somebody help me understand what this error means and how to recover from
>> it.
>>
>> 2017-02-02 09:44:24.648 ERROR
>> (recoveryExecutor-3-thread-16-processing-n:xx.xxx.xxx.xxx:1234_solr
>> x:clicktrack_shard1_replica4 s:shard1 c:clicktrack r:core_node3)
>> [c:clicktrack s:shard1 r:core_node3 x:clicktrack_shard1_replica4]
>> o.a.s.c.RecoveryStrategy Error while trying to recover.
>> core=clicktrack_shard1_replica4:org.apache.zookeeper.KeeperE
>> xception$SessionExpiredException:
>> KeeperErrorCode = Session expired for /overseer/queue/qn-
>>      at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:127)
>>      at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:51)
>>      at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>>      at
>> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkCl
>> ient.java:391)
>>      at
>> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkCl
>> ient.java:388)
>>      at
>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
>> CmdExecutor.java:60)
>>      at
>> org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
>>      at
>> org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:244)
>>      at org.apache.solr.cloud.ZkController.publish(ZkController.
>> java:1215)
>>      at org.apache.solr.cloud.ZkController.publish(ZkController.
>> java:1128)
>>      at org.apache.solr.cloud.ZkController.publish(ZkController.
>> java:1124)
>>      at
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoverySt
>> rategy.java:334)
>>      at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.
>> java:222)
>>      at
>> com.codahale.metrics.InstrumentedExecutorService$Instrumente
>> dRunnable.run(InstrumentedExecutorService.java:176)
>>      at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>      at
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>>      at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>>      at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>>      at java.lang.Thread.run(Thread.java:745)
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>> On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solr <ravis...@gmail.com> wrote:
>>
>> Hello,
>>>           Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12
>>> hours of debugging spree!! Can somebody kindly help me  out of this
>>> misery.
>>>
>>> I have a set has 8 single shard collections with 3 replicas. As soon as I
>>> updated the configs and started the servers one of my collection got
>>> stuck
>>> with no leader. I have restarted solr to no avail, I also tried to force
>>> a
>>> leader via collections API that dint work either. I also see that, from
>>> time to time multiple solr nodes go down all at the same time, only a
>>> restart resolves the issue.
>>>
>>> The error snippets are shown below
>>>
>>> 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n:
>>> 10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1
>>> c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1
>>> x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying
>>> to recover. core=clicktrack_shard1_replica1:org.apache.solr.common.
>>> SolrException:
>>> No registered leader was found after waiting for 4000ms , collection:
>>> clicktrack slice: shard1
>>>
>>> solr.log.9:2017-02-02 01:43:41.336 INFO  (zkCallback-4-thread-29-
>>> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
>>> cluster state change: [WatchedEvent state:SyncConnected
>>> type:NodeDataChanged path:/collections/clicktrack/state.json] for
>>> collection [clicktrack] has occurred - updating... (live nodes size: [1])
>>> solr.log.9:2017-02-02 01:43:42.224 INFO  (zkCallback-4-thread-29-
>>> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
>>> cluster state change: [WatchedEvent state:SyncConnected
>>> type:NodeDataChanged path:/collections/clicktrack/state.json] for
>>> collection [clicktrack] has occurred - updating... (live nodes size: [1])
>>> solr.log.9:2017-02-02 01:43:43.767 INFO  (zkCallback-4-thread-23-
>>> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
>>> cluster state change: [WatchedEvent state:SyncConnected
>>> type:NodeDataChanged path:/collections/clicktrack/state.json] for
>>> collection [clicktrack] has occurred - updating... (live nodes size: [1])
>>>
>>>
>>> Suspecting the worst I backed up the index and renamed the collection's
>>> data folder and restarted the servers, this time the collection got a
>>> proper leader. So is my index really corrupted ? Solr UI showed live
>>> nodes
>>> just like the logs but without any leader. Even with the leader issue
>>> somewhat alleviated after renaming the data folder and letting silr
>>> create
>>> a new data folder my servers did go down a couple of times.
>>>
>>> I am not all that well versed with zookeeper...any trick to make
>>> zookeeper
>>> pick a leader and be happy ? Did anybody have solr/zookeeper issues with
>>> 6.4.0 ?
>>>
>>> Thanks
>>>
>>> Ravi Kiran Bhaskar
>>>
>>>
>

Re: 6.4.0 collection leader election and recovery issues

Reply via email to