Hello,

I’m experiencing sort of the same issue, but with much smaller indexes - 
although with much higher latency on disks during backup sessions on our NFS. I 
have a feeling the solution could be the same, so I’ll just leave my story here 
just in case, no solution found yet. 
http://lucene.472066.n3.nabble.com/SolrCloud-never-fully-recovers-after-slow-disks-td4099350.html

--
Henrik Ossipoff Hansen
Developer, Entertainment Trading


On 12. nov. 2013 at 09.47.01, Martin de Vries 
(mar...@downnotifier.com<mailto://mar...@downnotifier.com>) wrote:

Hi,

We have:

Solr 4.5.1 - 5 servers
36 cores, 2 shards each, 2 servers per shard (every core is on 4
servers)
about 4.5 GB total data on disk per server
4GB JVM-Memory per server, 3GB average in use
Zookeeper 3.3.5 - 3 servers (one shared with Solr)
haproxy load balancing

Our Solrcloud is very unstable. About one time a week some cores go in
recovery state or down state. Many timeouts occur and we have to restart
servers to get them back to work. The failover doesn't work in many
cases, because one server has the core in down state, the other in
recovering state. Other cores work fine. When the cloud is stable I
sometimes see log messages like:
- shard update error StdNode:
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException:
IOException occured when talking to server at:
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1
- forwarding update to
http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed -
retrying ...
- null:ClientAbortException: java.io.IOException: Broken pipe

Before the the cloud problems start there are many large Qtime's in the
log (sometimes over 50 seconds), but there are no other errors until the
recovery problems start.


Any clue about what can be wrong?


Kinds regards,

Martin

Reply via email to