Hello,

We have recently put solr4 into production.

We have a 3 node cluster with a single shard. Each solr node is also a
zookeeper node, but zookeeper is running in cluster mode. We are using the
cloudera zookeeper package.

There is no communication problems between nodes. They are in two different
racks directly connected over a 2Gb uplink. The nodes each have a 1Gb
uplink.

I was thinking ideally mmsolr01 would be the leader, the application sends
all index requests directly to the leader node. A load balancer splits read
requests over the remaining two nodes.

We autocommit every 300s or 10k documents with a softcommit every 5s. The
index is roughly 200mm documents.

I have configured a cron to run every hour (on every node):
0 * * * * /usr/bin/curl -s '
http://localhost:8983/solr/collection1/replication?command=backup&numberToKeep=3'
> /dev/null

Using a snapshot seems to be the easiest way to reproduce, but it's also
possible to reproduce under very heavy indexing load.

When the snapshot is running, occasionally we get a zk timeout, causing the
leader to drop out of the cluster. We have also seen a few zk timeouts when
index load is very high.

After the failure it can take the now inconsistent node a few hours to
recover. After numerous failed recovery attempts the failed node seems to
sync up.

I have attached a log file demonstrating this.

We see lots of timeout requests, seemingly when the failed node tries to
sync up with the current leader by doing a full sync. This seems wrong,
there should be no reason for a timeout to happen here?

I am able to manually copy the index using tar + netcat in a few minutes.
The replication handler takes

INFO: Total time taken for download : 3549 secs

Why does it take so long to recover?

Are we better off manually replicating the index?

Much appreciated,
Thanks,
John

Attachment: sample.txt.gz
Description: GNU Zip compressed data

Reply via email to