Actually I was mistaken. I thought we were running 4.1.0 but we were actually running 4.0.0.
I will upgrade to 4.1.0 and see if this is still happening. Thanks, John On Wed, Jan 23, 2013 at 9:39 PM, John Skopis (lists) <jli...@skopis.com>wrote: > Sorry for leaving that bit out. This is Solr 4.1.0. > > Thanks again, > John > > On Wed, Jan 23, 2013 at 5:39 PM, Otis Gospodnetic < > otis.gospodne...@gmail.com> wrote: > >> Hi, >> >> Solr4 is 4.0 or 4.1? If the former try the latter first? >> >> Otis >> Solr & ElasticSearch Support >> http://sematext.com/ >> On Jan 23, 2013 2:51 PM, "John Skopis (lists)" <jli...@skopis.com> wrote: >> >> > Hello, >> > >> > We have recently put solr4 into production. >> > >> > We have a 3 node cluster with a single shard. Each solr node is also a >> > zookeeper node, but zookeeper is running in cluster mode. We are using >> the >> > cloudera zookeeper package. >> > >> > There is no communication problems between nodes. They are in two >> > different racks directly connected over a 2Gb uplink. The nodes each >> have a >> > 1Gb uplink. >> > >> > I was thinking ideally mmsolr01 would be the leader, the application >> sends >> > all index requests directly to the leader node. A load balancer splits >> read >> > requests over the remaining two nodes. >> > >> > We autocommit every 300s or 10k documents with a softcommit every 5s. >> The >> > index is roughly 200mm documents. >> > >> > I have configured a cron to run every hour (on every node): >> > 0 * * * * /usr/bin/curl -s ' >> > >> http://localhost:8983/solr/collection1/replication?command=backup&numberToKeep=3 >> ' >> > > /dev/null >> > >> > Using a snapshot seems to be the easiest way to reproduce, but it's also >> > possible to reproduce under very heavy indexing load. >> > >> > When the snapshot is running, occasionally we get a zk timeout, causing >> > the leader to drop out of the cluster. We have also seen a few zk >> timeouts >> > when index load is very high. >> > >> > After the failure it can take the now inconsistent node a few hours to >> > recover. After numerous failed recovery attempts the failed node seems >> to >> > sync up. >> > >> > I have attached a log file demonstrating this. >> > >> > We see lots of timeout requests, seemingly when the failed node tries to >> > sync up with the current leader by doing a full sync. This seems wrong, >> > there should be no reason for a timeout to happen here? >> > >> > I am able to manually copy the index using tar + netcat in a few >> minutes. >> > The replication handler takes >> > >> > INFO: Total time taken for download : 3549 secs >> > >> > Why does it take so long to recover? >> > >> > Are we better off manually replicating the index? >> > >> > Much appreciated, >> > Thanks, >> > John >> > >> > >> > >> > >> > >> > >> > >> > >> > >