Re: Solr4 SolrCloud ClusterState says we are the leader, but locally we don't think so

Otis Gospodnetic Wed, 23 Jan 2013 17:40:08 -0800

Hi,

Solr4 is 4.0 or 4.1? If the former try the latter first?


Otis
Solr & ElasticSearch Support
http://sematext.com/
On Jan 23, 2013 2:51 PM, "John Skopis (lists)" <jli...@skopis.com> wrote:

> Hello,
>
> We have recently put solr4 into production.
>
> We have a 3 node cluster with a single shard. Each solr node is also a
> zookeeper node, but zookeeper is running in cluster mode. We are using the
> cloudera zookeeper package.
>
> There is no communication problems between nodes. They are in two
> different racks directly connected over a 2Gb uplink. The nodes each have a
> 1Gb uplink.
>
> I was thinking ideally mmsolr01 would be the leader, the application sends
> all index requests directly to the leader node. A load balancer splits read
> requests over the remaining two nodes.
>
> We autocommit every 300s or 10k documents with a softcommit every 5s. The
> index is roughly 200mm documents.
>
> I have configured a cron to run every hour (on every node):
> 0 * * * * /usr/bin/curl -s '
> http://localhost:8983/solr/collection1/replication?command=backup&numberToKeep=3'
> > /dev/null
>
> Using a snapshot seems to be the easiest way to reproduce, but it's also
> possible to reproduce under very heavy indexing load.
>
> When the snapshot is running, occasionally we get a zk timeout, causing
> the leader to drop out of the cluster. We have also seen a few zk timeouts
> when index load is very high.
>
> After the failure it can take the now inconsistent node a few hours to
> recover. After numerous failed recovery attempts the failed node seems to
> sync up.
>
> I have attached a log file demonstrating this.
>
> We see lots of timeout requests, seemingly when the failed node tries to
> sync up with the current leader by doing a full sync. This seems wrong,
> there should be no reason for a timeout to happen here?
>
> I am able to manually copy the index using tar + netcat in a few minutes.
> The replication handler takes
>
> INFO: Total time taken for download : 3549 secs
>
> Why does it take so long to recover?
>
> Are we better off manually replicating the index?
>
> Much appreciated,
> Thanks,
> John
>
>
>
>
>
>
>
>

Re: Solr4 SolrCloud ClusterState says we are the leader, but locally we don't think so

Reply via email to