Hi Martin,

I have the same behaviour that you are describing with a setup that is pretty 
equal.

6 machines, ~50 shards with replicationFactor equal two.

The most critical issue IMHO is the fact of the failover doens't work because a 
node is down and the other in recovery mode.

In log I can see that exection says: My last state is recovering I won't be the 
leader and the other saying Zookeeper says I'm the leader but internally I 
don't think.

I don't know yet why in some situations 2 or 3 replicas go to recovery mode at 
the same time. I have a high index rate (>500 doc/s)

I moved the zookeeper to an own disk to ensure that latency is not a trouble.

In my case the fact of have other replica per shard is too expensive, my setup 
uses ssd's and the fact of set up a new replica involves more memory and 
resources.  


-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, November 12, 2013 at 8:45 AM, Martin de Vries wrote:

> Hi,
> 
> We have:
> 
> Solr 4.5.1 - 5 servers
> 36 cores, 2 shards each, 2 servers per shard (every core is on 4 
> servers)
> about 4.5 GB total data on disk per server
> 4GB JVM-Memory per server, 3GB average in use
> Zookeeper 3.3.5 - 3 servers (one shared with Solr)
> haproxy load balancing
> 
> Our Solrcloud is very unstable. About one time a week some cores go in 
> recovery state or down state. Many timeouts occur and we have to restart 
> servers to get them back to work. The failover doesn't work in many 
> cases, because one server has the core in down state, the other in 
> recovering state. Other cores work fine. When the cloud is stable I 
> sometimes see log messages like:
> - shard update error StdNode: 
> http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException:
>  
> IOException occured when talking to server at: 
> http://033.downnotifier.com:8983/solr/dntest_shard2_replica1
> - forwarding update to 
> http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed - 
> retrying ...
> - null:ClientAbortException: java.io.IOException: Broken pipe
> 
> Before the the cloud problems start there are many large Qtime's in the 
> log (sometimes over 50 seconds), but there are no other errors until the 
> recovery problems start.
> 
> 
> Any clue about what can be wrong?
> 
> 
> Kinds regards,
> 
> Martin 

Reply via email to