I have 3 different SolrCloud clusters that share a single set (3) of zookeeper 
servers. Each SolrCloud cluster has its own set of collections stored on 
Zookeeper. Twice in the past week all 3 clusters have had about a 1 minute 
period where all requests stopped coming in. Solr recovers and further requests 
are fine.

Inbound connections to the SolrCloud clusters are via load balancers. One of 
the clusters uses the /Ping request handler, the other two use TCP checks on 
port 8983 (yes, this is not ideal). The logs from the 2 clusters that use TCP 
for health check don't have anything notable in the logs except a lack of 
inbound requests. The logs from the cluster that has the health check to the 
Ping handler shows something odd at the time of the issue:

I cleaned out all requests and mutl-line errors and sorted alphabetically to 
line up the timestamps. You can see the Ping requests come in every 5 seconds 
then the Ping response for 9 of them seem to queue up inside of Solr which then 
causes the VMs to come out of the load balancer and stop taking requests:

INFO  - 2017-04-03 15:08:03.255; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:03.255; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=0
INFO  - 2017-04-03 15:08:08.256; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:13.254; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:18.247; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:23.250; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:28.256; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:33.263; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:38.274; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:43.266; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:48.279; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:53.289; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:54.188; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=10922
INFO  - 2017-04-03 15:08:54.188; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=15914
INFO  - 2017-04-03 15:08:54.188; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=20925
INFO  - 2017-04-03 15:08:54.188; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=25932
INFO  - 2017-04-03 15:08:54.188; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=30938
INFO  - 2017-04-03 15:08:54.188; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=35941
INFO  - 2017-04-03 15:08:54.188; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=40934
INFO  - 2017-04-03 15:08:54.188; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=5909
INFO  - 2017-04-03 15:08:54.188; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=899
INFO  - 2017-04-03 15:08:54.204; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=45948
INFO  - 2017-04-03 15:08:54.450; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.servlet.SolrDispatchFilter; Unable to 
write response, client closed connection or we are shutting down
INFO  - 2017-04-03 15:08:54.450; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.servlet.SolrDispatchFilter; Unable to 
write response, client closed connection or we are shutting down
INFO  - 2017-04-03 15:08:54.450; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.servlet.SolrDispatchFilter; Unable to 
write response, client closed connection or we are shutting down
INFO  - 2017-04-03 15:08:54.450; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.servlet.SolrDispatchFilter; Unable to 
write response, client closed connection or we are shutting down
INFO  - 2017-04-03 15:08:54.450; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.servlet.SolrDispatchFilter; Unable to 
write response, client closed connection or we are shutting down
INFO  - 2017-04-03 15:08:54.450; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.servlet.SolrDispatchFilter; Unable to 
write response, client closed connection or we are shutting down
INFO  - 2017-04-03 15:08:54.450; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.servlet.SolrDispatchFilter; Unable to 
write response, client closed connection or we are shutting down
INFO  - 2017-04-03 15:08:54.450; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.servlet.SolrDispatchFilter; Unable to 
write response, client closed connection or we are shutting down
INFO  - 2017-04-03 15:08:54.450; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.servlet.SolrDispatchFilter; Unable to 
write response, client closed connection or we are shutting down
INFO  - 2017-04-03 15:08:59.207; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:08:59.207; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=0
INFO  - 2017-04-03 15:09:04.215; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:09:04.229; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=14
INFO  - 2017-04-03 15:09:09.229; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} hits=272027 
status=0 QTime=0
INFO  - 2017-04-03 15:09:09.229; [products shard1 core_node1 
products_shard1_replica1] org.apache.solr.core.SolrCore; 
[products_shard1_replica1] webapp=/solr path=/admin/ping params={} status=0 
QTime=0

I can provide more log entries before and after showing commits, requests and 
index updates if needed. The same pattern can be seen across all 8 VMs though 
at roughly the same time. This feels like it could be load balancer or network 
related but wanted to run it by everyone to see if this potentially could be 
something on Solr or Zookeeper side.

The ping request handler is:

<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
   <lst name="invariants">
     <str name="qt">/select</str><!-- handler to delegate to -->
     <str name="q">*:*</str>
   </lst>
</requestHandler>

Thanks,
Will

Reply via email to