Re: 700k entries in overseer q cannot addreplica or deletereplica

Hendrik Haddorp Tue, 22 Aug 2017 05:54:15 -0700

We have Solr and ZK running in Docker containers. There is no more thenone Solr/ZK node per host but Solr and ZK node can run on the same host.So Solr and ZK are spread out separately.

I have not seen this problem during normal processing just when werecycle nodes or when we have nodes fail, which is pretty much alwayscaused by being out of memory, which again is unfortunately a bitcomplex in Docker. When nodes come up they add quite a few tasks to theoverseer queue. I assume one task for every core. We have about 2000cores on each node. If nodes come up too fast the queue might grow to afew thousand entries. At maybe 10000 entries it usually reaches thepoint of no return and Solr is just added more tasks then it is able toprocess. So it's best to pull the plug at that point as you will nothave to play with jute.maxbuffer to get Solr up again.


We are using Solr 6.3. There is some improvements in 6.6:
    https://issues.apache.org/jira/browse/SOLR-10524
    https://issues.apache.org/jira/browse/SOLR-10619

On 22.08.2017 14:41, Jeff Courtade wrote:

Thanks very much.

I will followup when we try this.

Im curious in the env this is happening to you.... are the zookeeper
servers residing on solr nodes? Are the solr nodes underpowered ram and or
cpu?

Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <[email protected]> wrote:

I'm always using a small Java program to delete the nodes directly. I
assume you can also delete the whole node but that is nothing I have tried
myself.

On 22.08.2017 14:27, Jeff Courtade wrote:

So ...

Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now.

Can I

   rmr /overseer/queue

Or do i need to delete individual entries?

Will

rmr /overseer/queue/*

work?




Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <[email protected]>
wrote:

When Solr is stopped it did not cause a problem so far.

I cleared the queue also a few times while Solr was still running. That
also didn't result in a real problem but some replicas might not come up
again. In those case it helps to either restart the node with the
replicas
that are in state "down" or to remove the failed replica and then
recreate
it. But as said, clearing it when Solr is stopped worked fine so far.

On 22.08.2017 14:03, Jeff Courtade wrote:

How does the cluster react to the overseer q entries disapeering?



Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <[email protected]>
wrote:

Hi Jeff,

we ran into that a few times already. We have lots of collections and
when
nodes get started too fast the overseer queue grows faster then Solr
can
process it. At some point Solr tries to redo things like leaders votes
and
adds new tasks to the list, which then gets longer and longer. Once it
is
too long you can not read out the data anymore but Solr is still adding
tasks. In case you already reached that point you have to start
ZooKeeper
and the ZooKeeper client with and increased "jute.maxbuffer" value. I
usually double it until I can read out the queue again. After that I
delete
all entries in the queue and then start the Solr nodes one by one, like
every 5 minutes.

regards,
Hendrik

On 22.08.2017 13:42, Jeff Courtade wrote:

Hi,

I have an issue with what seems to be a blocked up /overseer/queue

There are 700k + entries.

Solr cloud 6.x

You cannot addreplica or deletereplica the commands time out.

Full stop and start of solr and zookeeper does not clear it.

Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the
/overseer/queue ?


Jeff Courtade
M: 240.507.6116

Re: 700k entries in overseer q cannot addreplica or deletereplica

Reply via email to