Hello, We have recently put solr4 into production.
We have a 3 node cluster with a single shard. Each solr node is also a zookeeper node, but zookeeper is running in cluster mode. We are using the cloudera zookeeper package. There is no communication problems between nodes. They are in two different racks directly connected over a 2Gb uplink. The nodes each have a 1Gb uplink. I was thinking ideally mmsolr01 would be the leader, the application sends all index requests directly to the leader node. A load balancer splits read requests over the remaining two nodes. We autocommit every 300s or 10k documents with a softcommit every 5s. The index is roughly 200mm documents. I have configured a cron to run every hour (on every node): 0 * * * * /usr/bin/curl -s ' http://localhost:8983/solr/collection1/replication?command=backup&numberToKeep=3' > /dev/null Using a snapshot seems to be the easiest way to reproduce, but it's also possible to reproduce under very heavy indexing load. When the snapshot is running, occasionally we get a zk timeout, causing the leader to drop out of the cluster. We have also seen a few zk timeouts when index load is very high. After the failure it can take the now inconsistent node a few hours to recover. After numerous failed recovery attempts the failed node seems to sync up. I have attached a log file demonstrating this. We see lots of timeout requests, seemingly when the failed node tries to sync up with the current leader by doing a full sync. This seems wrong, there should be no reason for a timeout to happen here? I am able to manually copy the index using tar + netcat in a few minutes. The replication handler takes INFO: Total time taken for download : 3549 secs Why does it take so long to recover? Are we better off manually replicating the index? Much appreciated, Thanks, John
sample.txt.gz
Description: GNU Zip compressed data