We did some more monitoring and have some new information: 

Before
the issue happens the garbage collector's "collection count" increases a
lot. The increase seems to start about an hour before the real problem
occurs: 

http://www.analyticsforapplications.com/GC.png [1] 

We tried
both the g1 garbage collector and the regular one, the problem happens
with both of them. 

We use Java 1.6 on some servers. Will Java 1.7 be
better? 

Martin 

Martin de Vries schreef op 12.11.2013 10:45: 

>
Hi,
> 
> We have:
> 
> Solr 4.5.1 - 5 servers
> 36 cores, 2 shards each,
2 servers per shard (every core is on 4 
> servers)
> about 4.5 GB total
data on disk per server
> 4GB JVM-Memory per server, 3GB average in
use
> Zookeeper 3.3.5 - 3 servers (one shared with Solr)
> haproxy load
balancing
> 
> Our Solrcloud is very unstable. About one time a week
some cores go in 
> recovery state or down state. Many timeouts occur
and we have to restart 
> servers to get them back to work. The failover
doesn't work in many 
> cases, because one server has the core in down
state, the other in 
> recovering state. Other cores work fine. When the
cloud is stable I 
> sometimes see log messages like:
> - shard update
error StdNode: 
>
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException:

> IOException occured when talking to server at: 
>
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1
> -
forwarding update to 
>
http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed - 
>
retrying ...
> - null:ClientAbortException: java.io.IOException: Broken
pipe
> 
> Before the the cloud problems start there are many large
Qtime's in the 
> log (sometimes over 50 seconds), but there are no
other errors until the 
> recovery problems start.
> 
> Any clue about
what can be wrong?
> 
> Kinds regards,
> 
> Martin

 

Links:
------
[1]
http://www.analyticsforapplications.com/GC.png

Reply via email to