Later versions of Solr have been changed two ways:
1> changes have been made to not put so many items in the overseer
queue in the first place
2> changes have been made to process the messages that do get there
much more quickly.

Meanwhile, my guess is you have a lot of replicas out there. I've seen
this happen when there are lots of collections and/or replicas and
people try to start many of them up at once. One strategy to get by is
to start your Solr nodes a few at a time, wait for the Overseer queue
to get processed then start a few more. Unsatisfactory, but if the
precursor to this was starting all your Solr instances and you have a
lot of replicas, it may help until you can upgrade.

Best,
Erick

On Wed, Oct 25, 2017 at 5:44 PM, Tarjono, C. A.
<c.a.tarj...@accenture.com> wrote:
> @Shawn Heisey,
>
> Thanks so much for your input! We will try your suggestion and hope it will 
> resolve the issue.
>
> On the side note, would you know if this is an existing bug? if yes, has it 
> been resolved in later version? i.e. zk allows adding nodes when it exceeds 
> the buffer.
>
> We are currently using ZK 3.4.6 to use with SolrCloud 5.1.0.
>
> Thanks again!
>
> Best Regards,
>
> Christopher Tarjono
> Accenture Pte Ltd
>
> +65 9347 2484
> c.a.tarj...@accenture.com
> ________________________________
> From: Shawn Heisey <apa...@elyograg.org>
> Sent: 25 October 2017 20:57:30
> To: solr-user@lucene.apache.org
> Subject: [External] Re: SolrCloud not able to view cloud page - Loading of 
> "/solr/zookeeper?wt=json" failed (HTTP-Status 500)
>
> On 10/24/2017 8:11 AM, Tarjono, C. A. wrote:
>> Would like to check if anyone have seen this issue before, we started
>> having this a few days ago:
>>
>> Â
>>
>> The only error I can see in solr console is below:
>>
>> 5960847[main-SendThread(172.16.130.132:2281)] WARN
>> org.apache.zookeeper.ClientCnxn [ ] – Session 0x65f4e28b7370001 for
>> server 172.16.130.132/172.16.130.132:2281, unexpected error, closing
>> socket connection and attempting reconnect java.io.IOException: Packet
>> len30829010 is out of range!
>>
>
> Combining the last part of what I quoted above with the image you shared
> later, I am pretty sure I know what is happening.
>
> The overseer queue in zookeeper (at the ZK path of /overseer/queue) has
> a lot of entries in it.  Based on the fact that you are seeing a packet
> length beyond 30 million bytes, I am betting that the number of entries
> in the queue is between 1.5 million and 2 million.  ZK cannot handle
> that packet size without a special startup argument.  The value of the
> special parameter defaults to a little over one million bytes.
>
> To fix this, you're going to need to wipe out the overseer queue.  ZK
> includes a script named ZkCli.  Note that Solr includes a script called
> zkcli as well, which does very different things.  You need the one
> included with zookeeper.
>
> Wiping out the queue when it is that large is not straightforward.  You
> need to start the ZkCli script included with zookeeper with a
> -Djute.maxbuffer=31000000 argument and the same zkHost value used by
> Solr, and then use a command like "rmr /overseer/queue" in that command
> shell to completely remove the /overseer/queue path.  Then you can
> restart the ZK servers without the jute.maxbuffer setting.  You may need
> to restart Solr.  Running this procedure might also require temporarily
> restarting the ZK servers with the same jute.maxbuffer argument, but I
> am not sure whether that is required.
>
> The basic underlying problem here is that ZK allows adding new nodes
> even when the size of the parent node exceeds the default buffer size.Â
> That issue is documented here:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_ZOOKEEPER-2D1162&d=DwID-g&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=nMQjeyON92LbZ8rY3nXuv_He9mq8qtY9BEKkAyIxX-o&m=gk-2k71keLZeoINvrC1CZC2NLBiRkNVKK2VMu8UXb7Q&s=0ekWo10I-HOI3ppcq8pVpjzaHNaIhhE2XhhZnGUjn5M&e=
>
> I can't be sure why why your cloud is adding so many entries to the
> overseer queue.  I have seen this problem happen when restarting a
> server in the cloud, particularly when there are a large number of
> collections or shard replicas in the cloud.  Restarting multiple servers
> or restarting the same server multiple times without waiting for the
> overseer queue to empty could also cause the issue.
>
> Thanks,
> Shawn
>
>
> ________________________________
>
> This message is for the designated recipient only and may contain privileged, 
> proprietary, or otherwise confidential information. If you have received it 
> in error, please notify the sender immediately and delete the original. Any 
> other use of the e-mail by you is prohibited. Where allowed by local law, 
> electronic communications with Accenture and its affiliates, including e-mail 
> and instant messaging (including content), may be scanned by our systems for 
> the purposes of information security and assessment of internal compliance 
> with Accenture policy.
> ______________________________________________________________________________________
>
> www.accenture.com

Reply via email to