Actually I was mistaken. I thought we were running 4.1.0 but we were
actually running 4.0.0.

I will upgrade to 4.1.0 and see if this is still happening.

Thanks,
John

On Wed, Jan 23, 2013 at 9:39 PM, John Skopis (lists) <jli...@skopis.com>wrote:

> Sorry for leaving that bit out. This is Solr 4.1.0.
>
> Thanks again,
> John
>
> On Wed, Jan 23, 2013 at 5:39 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com> wrote:
>
>> Hi,
>>
>> Solr4 is 4.0 or 4.1? If the former try the latter first?
>>
>> Otis
>> Solr & ElasticSearch Support
>> http://sematext.com/
>> On Jan 23, 2013 2:51 PM, "John Skopis (lists)" <jli...@skopis.com> wrote:
>>
>> > Hello,
>> >
>> > We have recently put solr4 into production.
>> >
>> > We have a 3 node cluster with a single shard. Each solr node is also a
>> > zookeeper node, but zookeeper is running in cluster mode. We are using
>> the
>> > cloudera zookeeper package.
>> >
>> > There is no communication problems between nodes. They are in two
>> > different racks directly connected over a 2Gb uplink. The nodes each
>> have a
>> > 1Gb uplink.
>> >
>> > I was thinking ideally mmsolr01 would be the leader, the application
>> sends
>> > all index requests directly to the leader node. A load balancer splits
>> read
>> > requests over the remaining two nodes.
>> >
>> > We autocommit every 300s or 10k documents with a softcommit every 5s.
>> The
>> > index is roughly 200mm documents.
>> >
>> > I have configured a cron to run every hour (on every node):
>> > 0 * * * * /usr/bin/curl -s '
>> >
>> http://localhost:8983/solr/collection1/replication?command=backup&numberToKeep=3
>> '
>> > > /dev/null
>> >
>> > Using a snapshot seems to be the easiest way to reproduce, but it's also
>> > possible to reproduce under very heavy indexing load.
>> >
>> > When the snapshot is running, occasionally we get a zk timeout, causing
>> > the leader to drop out of the cluster. We have also seen a few zk
>> timeouts
>> > when index load is very high.
>> >
>> > After the failure it can take the now inconsistent node a few hours to
>> > recover. After numerous failed recovery attempts the failed node seems
>> to
>> > sync up.
>> >
>> > I have attached a log file demonstrating this.
>> >
>> > We see lots of timeout requests, seemingly when the failed node tries to
>> > sync up with the current leader by doing a full sync. This seems wrong,
>> > there should be no reason for a timeout to happen here?
>> >
>> > I am able to manually copy the index using tar + netcat in a few
>> minutes.
>> > The replication handler takes
>> >
>> > INFO: Total time taken for download : 3549 secs
>> >
>> > Why does it take so long to recover?
>> >
>> > Are we better off manually replicating the index?
>> >
>> > Much appreciated,
>> > Thanks,
>> > John
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
>

Reply via email to