zookeeper?wt=json" failed (HTTP-Status 500)

Shawn Heisey Wed, 25 Oct 2017 05:57:46 -0700

On 10/24/2017 8:11 AM, Tarjono, C. A. wrote:
> Would like to check if anyone have seen this issue before, we started
> having this a few days ago:
>
>  
>
> The only error I can see in solr console is below:
>
> 5960847[main-SendThread(172.16.130.132:2281)] WARN
> org.apache.zookeeper.ClientCnxn [ ] – Session 0x65f4e28b7370001 for
> server 172.16.130.132/172.16.130.132:2281, unexpected error, closing
> socket connection and attempting reconnect java.io.IOException: Packet
> len30829010 is out of range!
>


Combining the last part of what I quoted above with the image you shared
later, I am pretty sure I know what is happening.

The overseer queue in zookeeper (at the ZK path of /overseer/queue) has
a lot of entries in it.  Based on the fact that you are seeing a packet
length beyond 30 million bytes, I am betting that the number of entries
in the queue is between 1.5 million and 2 million.  ZK cannot handle
that packet size without a special startup argument.  The value of the
special parameter defaults to a little over one million bytes.

To fix this, you're going to need to wipe out the overseer queue.  ZK
includes a script named ZkCli.  Note that Solr includes a script called
zkcli as well, which does very different things.  You need the one
included with zookeeper.

Wiping out the queue when it is that large is not straightforward.  You
need to start the ZkCli script included with zookeeper with a
-Djute.maxbuffer=31000000 argument and the same zkHost value used by
Solr, and then use a command like "rmr /overseer/queue" in that command
shell to completely remove the /overseer/queue path.  Then you can
restart the ZK servers without the jute.maxbuffer setting.  You may need
to restart Solr.  Running this procedure might also require temporarily
restarting the ZK servers with the same jute.maxbuffer argument, but I
am not sure whether that is required.

The basic underlying problem here is that ZK allows adding new nodes
even when the size of the parent node exceeds the default buffer size. 
That issue is documented here:

https://issues.apache.org/jira/browse/ZOOKEEPER-1162

I can't be sure why why your cloud is adding so many entries to the
overseer queue.  I have seen this problem happen when restarting a
server in the cloud, particularly when there are a large number of
collections or shard replicas in the cloud.  Restarting multiple servers
or restarting the same server multiple times without waiting for the
overseer queue to empty could also cause the issue.

Thanks,
Shawn

Re: SolrCloud not able to view cloud page - Loading of "/solr/zookeeper?wt=json" failed (HTTP-Status 500)

Reply via email to