On 10/24/2017 8:11 AM, Tarjono, C. A. wrote: > Would like to check if anyone have seen this issue before, we started > having this a few days ago: > > > > The only error I can see in solr console is below: > > 5960847[main-SendThread(172.16.130.132:2281)] WARN > org.apache.zookeeper.ClientCnxn [ ] – Session 0x65f4e28b7370001 for > server 172.16.130.132/172.16.130.132:2281, unexpected error, closing > socket connection and attempting reconnect java.io.IOException: Packet > len30829010 is out of range! >
Combining the last part of what I quoted above with the image you shared later, I am pretty sure I know what is happening. The overseer queue in zookeeper (at the ZK path of /overseer/queue) has a lot of entries in it. Based on the fact that you are seeing a packet length beyond 30 million bytes, I am betting that the number of entries in the queue is between 1.5 million and 2 million. ZK cannot handle that packet size without a special startup argument. The value of the special parameter defaults to a little over one million bytes. To fix this, you're going to need to wipe out the overseer queue. ZK includes a script named ZkCli. Note that Solr includes a script called zkcli as well, which does very different things. You need the one included with zookeeper. Wiping out the queue when it is that large is not straightforward. You need to start the ZkCli script included with zookeeper with a -Djute.maxbuffer=31000000 argument and the same zkHost value used by Solr, and then use a command like "rmr /overseer/queue" in that command shell to completely remove the /overseer/queue path. Then you can restart the ZK servers without the jute.maxbuffer setting. You may need to restart Solr. Running this procedure might also require temporarily restarting the ZK servers with the same jute.maxbuffer argument, but I am not sure whether that is required. The basic underlying problem here is that ZK allows adding new nodes even when the size of the parent node exceeds the default buffer size. That issue is documented here: https://issues.apache.org/jira/browse/ZOOKEEPER-1162 I can't be sure why why your cloud is adding so many entries to the overseer queue. I have seen this problem happen when restarting a server in the cloud, particularly when there are a large number of collections or shard replicas in the cloud. Restarting multiple servers or restarting the same server multiple times without waiting for the overseer queue to empty could also cause the issue. Thanks, Shawn