I set jute.maxbuffer on the so hosts should this be done to solr as well? Mine is happening in a severely memory constrained end as well.
Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:53 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: > We have Solr and ZK running in Docker containers. There is no more then > one Solr/ZK node per host but Solr and ZK node can run on the same host. So > Solr and ZK are spread out separately. > > I have not seen this problem during normal processing just when we recycle > nodes or when we have nodes fail, which is pretty much always caused by > being out of memory, which again is unfortunately a bit complex in Docker. > When nodes come up they add quite a few tasks to the overseer queue. I > assume one task for every core. We have about 2000 cores on each node. If > nodes come up too fast the queue might grow to a few thousand entries. At > maybe 10000 entries it usually reaches the point of no return and Solr is > just added more tasks then it is able to process. So it's best to pull the > plug at that point as you will not have to play with jute.maxbuffer to get > Solr up again. > > We are using Solr 6.3. There is some improvements in 6.6: > https://issues.apache.org/jira/browse/SOLR-10524 > https://issues.apache.org/jira/browse/SOLR-10619 > > On 22.08.2017 14:41, Jeff Courtade wrote: > >> Thanks very much. >> >> I will followup when we try this. >> >> Im curious in the env this is happening to you.... are the zookeeper >> servers residing on solr nodes? Are the solr nodes underpowered ram and or >> cpu? >> >> Jeff Courtade >> M: 240.507.6116 >> >> On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> >> wrote: >> >> I'm always using a small Java program to delete the nodes directly. I >>> assume you can also delete the whole node but that is nothing I have >>> tried >>> myself. >>> >>> On 22.08.2017 14:27, Jeff Courtade wrote: >>> >>> So ... >>>> >>>> Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now. >>>> >>>> Can I >>>> >>>> rmr /overseer/queue >>>> >>>> Or do i need to delete individual entries? >>>> >>>> Will >>>> >>>> rmr /overseer/queue/* >>>> >>>> work? >>>> >>>> >>>> >>>> >>>> Jeff Courtade >>>> M: 240.507.6116 >>>> >>>> On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> >>>> wrote: >>>> >>>> When Solr is stopped it did not cause a problem so far. >>>> >>>>> I cleared the queue also a few times while Solr was still running. That >>>>> also didn't result in a real problem but some replicas might not come >>>>> up >>>>> again. In those case it helps to either restart the node with the >>>>> replicas >>>>> that are in state "down" or to remove the failed replica and then >>>>> recreate >>>>> it. But as said, clearing it when Solr is stopped worked fine so far. >>>>> >>>>> On 22.08.2017 14:03, Jeff Courtade wrote: >>>>> >>>>> How does the cluster react to the overseer q entries disapeering? >>>>> >>>>>> >>>>>> >>>>>> Jeff Courtade >>>>>> M: 240.507.6116 >>>>>> >>>>>> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> >>>>>> wrote: >>>>>> >>>>>> Hi Jeff, >>>>>> >>>>>> we ran into that a few times already. We have lots of collections and >>>>>>> when >>>>>>> nodes get started too fast the overseer queue grows faster then Solr >>>>>>> can >>>>>>> process it. At some point Solr tries to redo things like leaders >>>>>>> votes >>>>>>> and >>>>>>> adds new tasks to the list, which then gets longer and longer. Once >>>>>>> it >>>>>>> is >>>>>>> too long you can not read out the data anymore but Solr is still >>>>>>> adding >>>>>>> tasks. In case you already reached that point you have to start >>>>>>> ZooKeeper >>>>>>> and the ZooKeeper client with and increased "jute.maxbuffer" value. I >>>>>>> usually double it until I can read out the queue again. After that I >>>>>>> delete >>>>>>> all entries in the queue and then start the Solr nodes one by one, >>>>>>> like >>>>>>> every 5 minutes. >>>>>>> >>>>>>> regards, >>>>>>> Hendrik >>>>>>> >>>>>>> On 22.08.2017 13:42, Jeff Courtade wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have an issue with what seems to be a blocked up /overseer/queue >>>>>>>> >>>>>>>> There are 700k + entries. >>>>>>>> >>>>>>>> Solr cloud 6.x >>>>>>>> >>>>>>>> You cannot addreplica or deletereplica the commands time out. >>>>>>>> >>>>>>>> Full stop and start of solr and zookeeper does not clear it. >>>>>>>> >>>>>>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the >>>>>>>> /overseer/queue ? >>>>>>>> >>>>>>>> >>>>>>>> Jeff Courtade >>>>>>>> M: 240.507.6116 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >