Re: 700k entries in overseer q cannot addreplica or deletereplica

Erick Erickson Tue, 22 Aug 2017 17:27:45 -0700

This has been an occasional problem with clusters with lots of
replicas in aggregate. There was a major improvement in how large
Overseer queues are handled in SOLR-10619 which was released with Solr
6.6. that you might want to look at.


If you can't go to 6.6 (or apply the patch yourself to your version),
you can start up your nodes more gradually. Essentially the longer the
queue the more time it would take to process each entry so it starts
to spin out of control.

There are several other improvements, but that's the biggest I think.

Best,
Erick

On Tue, Aug 22, 2017 at 11:32 AM, Hendrik Haddorp
<[email protected]> wrote:
> It is a known problem:
> https://cwiki.apache.org/confluence/display/CURATOR/TN4
>
> There are multiple JIRAs around this, like the one I pointed to earlier:
> https://issues.apache.org/jira/browse/SOLR-10524
> There it states:
> This JIRA is to break out that part of the discussion as it might be an easy
> win whereas "eliminating the Overseer queue" would be quite an undertaking.
>
> I assume this issue only shows up if you have many cores. There are also
> some config settings that might have an effect but I have not really figured
> out the magic settings. As said Solr 6.6 might also work better.
>
>
> On 22.08.2017 19:18, Jeff Courtade wrote:
>>
>> righto,
>>
>> thanks very much for your help clarifying this. I am not alone :)
>>
>> I have been looking at this for a few days now.
>>
>> I am seeing people who have experienced this issue going back to solr
>> version 4.x.
>>
>> I am wondering if it is an underlying issue with the way the q is managed.
>>
>> I would think that it should not be able to be put into a state that is
>> not
>> recoverable except destructively.
>>
>> If you have a very active  solr cluster this could cause data loss I am
>> thinking.
>>
>>
>>
>>
>>
>>
>> --
>> Thanks,
>>
>> Jeff Courtade
>> M: 240.507.6116
>>
>> On Tue, Aug 22, 2017 at 1:14 PM, Hendrik Haddorp <[email protected]>
>> wrote:
>>
>>> - stop all solr nodes
>>> - start zk with the new jute.maxbuffer setting
>>> - start a zk client, like zkCli, with the changed jute.maxbuffer setting
>>> and check that you can read out the overseer queue
>>> - clear the queue
>>> - restart zk with the normal settings
>>> - slowly start solr
>>>
>>> On 22.08.2017 15:27, Jeff Courtade wrote:
>>>
>>>> I set jute.maxbuffer on the so hosts should this be done to solr as
>>>> well?
>>>>
>>>> Mine is happening in a severely memory constrained end as well.
>>>>
>>>> Jeff Courtade
>>>> M: 240.507.6116
>>>>
>>>> On Aug 22, 2017 8:53 AM, "Hendrik Haddorp" <[email protected]>
>>>> wrote:
>>>>
>>>> We have Solr and ZK running in Docker containers. There is no more then
>>>>>
>>>>> one Solr/ZK node per host but Solr and ZK node can run on the same
>>>>> host.
>>>>> So
>>>>> Solr and ZK are spread out separately.
>>>>>
>>>>> I have not seen this problem during normal processing just when we
>>>>> recycle
>>>>> nodes or when we have nodes fail, which is pretty much always caused by
>>>>> being out of memory, which again is unfortunately a bit complex in
>>>>> Docker.
>>>>> When nodes come up they add quite a few tasks to the overseer queue. I
>>>>> assume one task for every core. We have about 2000 cores on each node.
>>>>> If
>>>>> nodes come up too fast the queue might grow to a few thousand entries.
>>>>> At
>>>>> maybe 10000 entries it usually reaches the point of no return and Solr
>>>>> is
>>>>> just added more tasks then it is able to process. So it's best to pull
>>>>> the
>>>>> plug at that point as you will not have to play with jute.maxbuffer to
>>>>> get
>>>>> Solr up again.
>>>>>
>>>>> We are using Solr 6.3. There is some improvements in 6.6:
>>>>>       https://issues.apache.org/jira/browse/SOLR-10524
>>>>>       https://issues.apache.org/jira/browse/SOLR-10619
>>>>>
>>>>> On 22.08.2017 14:41, Jeff Courtade wrote:
>>>>>
>>>>> Thanks very much.
>>>>>>
>>>>>> I will followup when we try this.
>>>>>>
>>>>>> Im curious in the env this is happening to you.... are the zookeeper
>>>>>> servers residing on solr nodes? Are the solr nodes underpowered ram
>>>>>> and
>>>>>> or
>>>>>> cpu?
>>>>>>
>>>>>> Jeff Courtade
>>>>>> M: 240.507.6116
>>>>>>
>>>>>> On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> I'm always using a small Java program to delete the nodes directly. I
>>>>>>
>>>>>>> assume you can also delete the whole node but that is nothing I have
>>>>>>> tried
>>>>>>> myself.
>>>>>>>
>>>>>>> On 22.08.2017 14:27, Jeff Courtade wrote:
>>>>>>>
>>>>>>> So ...
>>>>>>>
>>>>>>>> Using the zkCli.sh i have the jute.maxbuffer setup so I can list it
>>>>>>>> now.
>>>>>>>>
>>>>>>>> Can I
>>>>>>>>
>>>>>>>>      rmr /overseer/queue
>>>>>>>>
>>>>>>>> Or do i need to delete individual entries?
>>>>>>>>
>>>>>>>> Will
>>>>>>>>
>>>>>>>> rmr /overseer/queue/*
>>>>>>>>
>>>>>>>> work?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Jeff Courtade
>>>>>>>> M: 240.507.6116
>>>>>>>>
>>>>>>>> On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> When Solr is stopped it did not cause a problem so far.
>>>>>>>>
>>>>>>>> I cleared the queue also a few times while Solr was still running.
>>>>>>>>>
>>>>>>>>> That
>>>>>>>>> also didn't result in a real problem but some replicas might not
>>>>>>>>> come
>>>>>>>>> up
>>>>>>>>> again. In those case it helps to either restart the node with the
>>>>>>>>> replicas
>>>>>>>>> that are in state "down" or to remove the failed replica and then
>>>>>>>>> recreate
>>>>>>>>> it. But as said, clearing it when Solr is stopped worked fine so
>>>>>>>>> far.
>>>>>>>>>
>>>>>>>>> On 22.08.2017 14:03, Jeff Courtade wrote:
>>>>>>>>>
>>>>>>>>> How does the cluster react to the overseer q entries disapeering?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Jeff Courtade
>>>>>>>>>> M: 240.507.6116
>>>>>>>>>>
>>>>>>>>>> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp"
>>>>>>>>>> <[email protected]
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Jeff,
>>>>>>>>>>
>>>>>>>>>> we ran into that a few times already. We have lots of collections
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>> when
>>>>>>>>>>> nodes get started too fast the overseer queue grows faster then
>>>>>>>>>>> Solr
>>>>>>>>>>> can
>>>>>>>>>>> process it. At some point Solr tries to redo things like leaders
>>>>>>>>>>> votes
>>>>>>>>>>> and
>>>>>>>>>>> adds new tasks to the list, which then gets longer and longer.
>>>>>>>>>>> Once
>>>>>>>>>>> it
>>>>>>>>>>> is
>>>>>>>>>>> too long you can not read out the data anymore but Solr is still
>>>>>>>>>>> adding
>>>>>>>>>>> tasks. In case you already reached that point you have to start
>>>>>>>>>>> ZooKeeper
>>>>>>>>>>> and the ZooKeeper client with and increased "jute.maxbuffer"
>>>>>>>>>>> value. I
>>>>>>>>>>> usually double it until I can read out the queue again. After
>>>>>>>>>>> that
>>>>>>>>>>> I
>>>>>>>>>>> delete
>>>>>>>>>>> all entries in the queue and then start the Solr nodes one by
>>>>>>>>>>> one,
>>>>>>>>>>> like
>>>>>>>>>>> every 5 minutes.
>>>>>>>>>>>
>>>>>>>>>>> regards,
>>>>>>>>>>> Hendrik
>>>>>>>>>>>
>>>>>>>>>>> On 22.08.2017 13:42, Jeff Courtade wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I have an issue with what seems to be a blocked up
>>>>>>>>>>> /overseer/queue
>>>>>>>>>>>
>>>>>>>>>>>> There are 700k + entries.
>>>>>>>>>>>>
>>>>>>>>>>>> Solr cloud 6.x
>>>>>>>>>>>>
>>>>>>>>>>>> You cannot addreplica or deletereplica the commands time out.
>>>>>>>>>>>>
>>>>>>>>>>>> Full stop and start of solr and zookeeper does not clear it.
>>>>>>>>>>>>
>>>>>>>>>>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr
>>>>>>>>>>>> the
>>>>>>>>>>>> /overseer/queue ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Jeff Courtade
>>>>>>>>>>>> M: 240.507.6116
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>

Re: 700k entries in overseer q cannot addreplica or deletereplica

Reply via email to