Re: Why is SolrCloud doing a full copy of the index?

Shawn Heisey Sat, 04 May 2013 10:45:31 -0700

On 5/4/2013 2:06 AM, Kumar Limbu wrote:
> We have Solr setup on 3 machines with only a single shard. We are using Solr
> 4.0 and currently have around 7 Million documents in our index. The size of
> our index is around 25 GB. We have a zookeeper ensemble of 3 zookeeper
> instances.
>  
> Let's call the servers in our setup server (A), (B) and (C). All updates to
> Solr goes via server (C). Searches are performed on server (A) and (B). The
> updates are normally propagated incrementally from server (C) to the other 2
> servers.  Intermittently we have noted that the servers (A) and (B) makes a
> full copy of the index from server (C). This is not ideal because when this
> happens performance suffers. This occurs quite randomly and can occur on any
> of the other 2 nodes i.e. (A) and (B). 
>  
> On the server (C), which is the leader, we see errors like the following .We
> suspect this might be the reason why a full index copy occurs in the other
> nodes but we haven't been able to find out why this error is occurring.
> There is no connectivity issue with the servers.


Advance warning: this is a long reply.

The first thing that jumped out at me was the Solr version.  Version 4.0
was brand new in October of last year.  It's a senior citizen now.  It
has a lot of bugs, particularly in SolrCloud stability.  I would
recommend upgrading to at least 4.2.1.

Version 4.3.0 (the fourth since 4.0) is quite literally about to be
unveiled.  It is already on a lot of download mirrors, the announcement
is due any time now.

Now for things to consider that don't involve upgrading, but might still
be issues after upgrading:

You might be able to make your system more stable by increasing your
zkClientTimeout.  A typical example value for this setting is 15
seconds. Next we will discuss why you might be exceeding the timeout:

Slow operations, especially on commits, can be responsible for exceeding
timeouts.  One of the things you can do to decrease commit time is to
lower the autowarmCount on your Solr caches.  You can also decrease the
frequency of your commits.

A 25GB index is relatively large, and requires a lot of memory for
proper operation.  The reason it requires a lot of memory is because
Solr is very reliant on the operating system disk cache, which uses free
memory.

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

With a 25GB index, you want to have between 15 and 25GB of memory over
and above the memory that your programs use.  You would probably want to
give the Java heap for Solr between 4 and 8GB.  For a dedicated Solr
server with your index, a really good amount of total system memory
would be 32GB, with 24GB being a reasonable starting point.

It should go without saying that you need a 64 bit server, a 64 bit
operating system, and 64 bit Java for all this to work correctly.  32
bit software is not good at dealing with large amounts of memory, and 32
bit Java cannot have a heap size larger than 2GB.

If you upgrade to 4.2.1 or later and reindex, your index size will drop
due to compression of certain pieces.  Those pieces don't normally
affect minimum memory requirements very much, so your free memory
requirement will still probably be at least 15GB.

Unless you are using a commercial JVM with low-pause characteristics
(like Zing), a heap of 4GB or larger can give you problems with
stop-the-world GC pauses.  A large heap is unfortunately required with a
large index.  The default collector that Java gives you is a *terrible*
choice for large heaps in general and Solr in particular.  Even changing
to the CMS collector may not be enough - more tuning is required.

Thanks,
Shawn

Re: Why is SolrCloud doing a full copy of the index?

Reply via email to