Sorry for leaving that bit out. This is Solr 4.1.0. Thanks again, John
On Wed, Jan 23, 2013 at 5:39 PM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Hi, > > Solr4 is 4.0 or 4.1? If the former try the latter first? > > Otis > Solr & ElasticSearch Support > http://sematext.com/ > On Jan 23, 2013 2:51 PM, "John Skopis (lists)" <jli...@skopis.com> wrote: > > > Hello, > > > > We have recently put solr4 into production. > > > > We have a 3 node cluster with a single shard. Each solr node is also a > > zookeeper node, but zookeeper is running in cluster mode. We are using > the > > cloudera zookeeper package. > > > > There is no communication problems between nodes. They are in two > > different racks directly connected over a 2Gb uplink. The nodes each > have a > > 1Gb uplink. > > > > I was thinking ideally mmsolr01 would be the leader, the application > sends > > all index requests directly to the leader node. A load balancer splits > read > > requests over the remaining two nodes. > > > > We autocommit every 300s or 10k documents with a softcommit every 5s. The > > index is roughly 200mm documents. > > > > I have configured a cron to run every hour (on every node): > > 0 * * * * /usr/bin/curl -s ' > > > http://localhost:8983/solr/collection1/replication?command=backup&numberToKeep=3 > ' > > > /dev/null > > > > Using a snapshot seems to be the easiest way to reproduce, but it's also > > possible to reproduce under very heavy indexing load. > > > > When the snapshot is running, occasionally we get a zk timeout, causing > > the leader to drop out of the cluster. We have also seen a few zk > timeouts > > when index load is very high. > > > > After the failure it can take the now inconsistent node a few hours to > > recover. After numerous failed recovery attempts the failed node seems to > > sync up. > > > > I have attached a log file demonstrating this. > > > > We see lots of timeout requests, seemingly when the failed node tries to > > sync up with the current leader by doing a full sync. This seems wrong, > > there should be no reason for a timeout to happen here? > > > > I am able to manually copy the index using tar + netcat in a few minutes. > > The replication handler takes > > > > INFO: Total time taken for download : 3549 secs > > > > Why does it take so long to recover? > > > > Are we better off manually replicating the index? > > > > Much appreciated, > > Thanks, > > John > > > > > > > > > > > > > > > > >