Sorry for leaving that bit out. This is Solr 4.1.0.

Thanks again,
John

On Wed, Jan 23, 2013 at 5:39 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi,
>
> Solr4 is 4.0 or 4.1? If the former try the latter first?
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Jan 23, 2013 2:51 PM, "John Skopis (lists)" <jli...@skopis.com> wrote:
>
> > Hello,
> >
> > We have recently put solr4 into production.
> >
> > We have a 3 node cluster with a single shard. Each solr node is also a
> > zookeeper node, but zookeeper is running in cluster mode. We are using
> the
> > cloudera zookeeper package.
> >
> > There is no communication problems between nodes. They are in two
> > different racks directly connected over a 2Gb uplink. The nodes each
> have a
> > 1Gb uplink.
> >
> > I was thinking ideally mmsolr01 would be the leader, the application
> sends
> > all index requests directly to the leader node. A load balancer splits
> read
> > requests over the remaining two nodes.
> >
> > We autocommit every 300s or 10k documents with a softcommit every 5s. The
> > index is roughly 200mm documents.
> >
> > I have configured a cron to run every hour (on every node):
> > 0 * * * * /usr/bin/curl -s '
> >
> http://localhost:8983/solr/collection1/replication?command=backup&numberToKeep=3
> '
> > > /dev/null
> >
> > Using a snapshot seems to be the easiest way to reproduce, but it's also
> > possible to reproduce under very heavy indexing load.
> >
> > When the snapshot is running, occasionally we get a zk timeout, causing
> > the leader to drop out of the cluster. We have also seen a few zk
> timeouts
> > when index load is very high.
> >
> > After the failure it can take the now inconsistent node a few hours to
> > recover. After numerous failed recovery attempts the failed node seems to
> > sync up.
> >
> > I have attached a log file demonstrating this.
> >
> > We see lots of timeout requests, seemingly when the failed node tries to
> > sync up with the current leader by doing a full sync. This seems wrong,
> > there should be no reason for a timeout to happen here?
> >
> > I am able to manually copy the index using tar + netcat in a few minutes.
> > The replication handler takes
> >
> > INFO: Total time taken for download : 3549 secs
> >
> > Why does it take so long to recover?
> >
> > Are we better off manually replicating the index?
> >
> > Much appreciated,
> > Thanks,
> > John
> >
> >
> >
> >
> >
> >
> >
> >
>

Reply via email to