Later versions of Solr have been changed two ways: 1> changes have been made to not put so many items in the overseer queue in the first place 2> changes have been made to process the messages that do get there much more quickly.
Meanwhile, my guess is you have a lot of replicas out there. I've seen this happen when there are lots of collections and/or replicas and people try to start many of them up at once. One strategy to get by is to start your Solr nodes a few at a time, wait for the Overseer queue to get processed then start a few more. Unsatisfactory, but if the precursor to this was starting all your Solr instances and you have a lot of replicas, it may help until you can upgrade. Best, Erick On Wed, Oct 25, 2017 at 5:44 PM, Tarjono, C. A. <c.a.tarj...@accenture.com> wrote: > @Shawn Heisey, > > Thanks so much for your input! We will try your suggestion and hope it will > resolve the issue. > > On the side note, would you know if this is an existing bug? if yes, has it > been resolved in later version? i.e. zk allows adding nodes when it exceeds > the buffer. > > We are currently using ZK 3.4.6 to use with SolrCloud 5.1.0. > > Thanks again! > > Best Regards, > > Christopher Tarjono > Accenture Pte Ltd > > +65 9347 2484 > c.a.tarj...@accenture.com > ________________________________ > From: Shawn Heisey <apa...@elyograg.org> > Sent: 25 October 2017 20:57:30 > To: solr-user@lucene.apache.org > Subject: [External] Re: SolrCloud not able to view cloud page - Loading of > "/solr/zookeeper?wt=json" failed (HTTP-Status 500) > > On 10/24/2017 8:11 AM, Tarjono, C. A. wrote: >> Would like to check if anyone have seen this issue before, we started >> having this a few days ago: >> >>  >> >> The only error I can see in solr console is below: >> >> 5960847[main-SendThread(172.16.130.132:2281)] WARN >> org.apache.zookeeper.ClientCnxn [ ] – Session 0x65f4e28b7370001 for >> server 172.16.130.132/172.16.130.132:2281, unexpected error, closing >> socket connection and attempting reconnect java.io.IOException: Packet >> len30829010 is out of range! >> > > Combining the last part of what I quoted above with the image you shared > later, I am pretty sure I know what is happening. > > The overseer queue in zookeeper (at the ZK path of /overseer/queue) has > a lot of entries in it. Based on the fact that you are seeing a packet > length beyond 30 million bytes, I am betting that the number of entries > in the queue is between 1.5 million and 2 million. ZK cannot handle > that packet size without a special startup argument. The value of the > special parameter defaults to a little over one million bytes. > > To fix this, you're going to need to wipe out the overseer queue. ZK > includes a script named ZkCli. Note that Solr includes a script called > zkcli as well, which does very different things. You need the one > included with zookeeper. > > Wiping out the queue when it is that large is not straightforward. You > need to start the ZkCli script included with zookeeper with a > -Djute.maxbuffer=31000000 argument and the same zkHost value used by > Solr, and then use a command like "rmr /overseer/queue" in that command > shell to completely remove the /overseer/queue path. Then you can > restart the ZK servers without the jute.maxbuffer setting. You may need > to restart Solr. Running this procedure might also require temporarily > restarting the ZK servers with the same jute.maxbuffer argument, but I > am not sure whether that is required. > > The basic underlying problem here is that ZK allows adding new nodes > even when the size of the parent node exceeds the default buffer size. > That issue is documented here: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_ZOOKEEPER-2D1162&d=DwID-g&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=nMQjeyON92LbZ8rY3nXuv_He9mq8qtY9BEKkAyIxX-o&m=gk-2k71keLZeoINvrC1CZC2NLBiRkNVKK2VMu8UXb7Q&s=0ekWo10I-HOI3ppcq8pVpjzaHNaIhhE2XhhZnGUjn5M&e= > > I can't be sure why why your cloud is adding so many entries to the > overseer queue. I have seen this problem happen when restarting a > server in the cloud, particularly when there are a large number of > collections or shard replicas in the cloud. Restarting multiple servers > or restarting the same server multiple times without waiting for the > overseer queue to empty could also cause the issue. > > Thanks, > Shawn > > > ________________________________ > > This message is for the designated recipient only and may contain privileged, > proprietary, or otherwise confidential information. If you have received it > in error, please notify the sender immediately and delete the original. Any > other use of the e-mail by you is prohibited. Where allowed by local law, > electronic communications with Accenture and its affiliates, including e-mail > and instant messaging (including content), may be scanned by our systems for > the purposes of information security and assessment of internal compliance > with Accenture policy. > ______________________________________________________________________________________ > > www.accenture.com