Hi Martin, I have the same behaviour that you are describing with a setup that is pretty equal.
6 machines, ~50 shards with replicationFactor equal two. The most critical issue IMHO is the fact of the failover doens't work because a node is down and the other in recovery mode. In log I can see that exection says: My last state is recovering I won't be the leader and the other saying Zookeeper says I'm the leader but internally I don't think. I don't know yet why in some situations 2 or 3 replicas go to recovery mode at the same time. I have a high index rate (>500 doc/s) I moved the zookeeper to an own disk to ensure that latency is not a trouble. In my case the fact of have other replica per shard is too expensive, my setup uses ssd's and the fact of set up a new replica involves more memory and resources. -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Tuesday, November 12, 2013 at 8:45 AM, Martin de Vries wrote: > Hi, > > We have: > > Solr 4.5.1 - 5 servers > 36 cores, 2 shards each, 2 servers per shard (every core is on 4 > servers) > about 4.5 GB total data on disk per server > 4GB JVM-Memory per server, 3GB average in use > Zookeeper 3.3.5 - 3 servers (one shared with Solr) > haproxy load balancing > > Our Solrcloud is very unstable. About one time a week some cores go in > recovery state or down state. Many timeouts occur and we have to restart > servers to get them back to work. The failover doesn't work in many > cases, because one server has the core in down state, the other in > recovering state. Other cores work fine. When the cloud is stable I > sometimes see log messages like: > - shard update error StdNode: > http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException: > > IOException occured when talking to server at: > http://033.downnotifier.com:8983/solr/dntest_shard2_replica1 > - forwarding update to > http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed - > retrying ... > - null:ClientAbortException: java.io.IOException: Broken pipe > > Before the the cloud problems start there are many large Qtime's in the > log (sometimes over 50 seconds), but there are no other errors until the > recovery problems start. > > > Any clue about what can be wrong? > > > Kinds regards, > > Martin