On 4/2/2015 4:46 PM, Ryan Steele wrote:
> Thank you Shawn and Toke for the information and links! No, I was not
> the one on #solr IRC channel. :/ Here are the details I have right
> now: I'm building/running the operations side of this new SolrCloud
> cluster. It will be in Amazon, the initial cluster I'm planning to
> start with is 5 r3.xlarge instances each using a general purpose SSD
> EBS volume for the SolrCloud related data (this will be separate from
> the EBS volume used by the OS). Each instance has 30.5 GiB RAM--152.5
> GiB cluster wide--and each instance has 4 vCPU's. I'm using Oracle
> Java 1.8.0_31 and the G1 GC. 

Java 8u40 is supposed to have some significant improvements to G1
garbage collection, so I would recommend an upgrade from 8u31.  I heard
this directly from Oracle engineers on a mailing list for GC issues.

> The data will be indexed on a separate machine and added to the
> SolrCloud cluster while searching is taking place. Unfortunately I
> don't have numbers at this time on how much data will be indexed. I do
> know that we will have over 2000 collections--some will be small (a
> few hundred documents and only a few megabytes at most), and a few
> will be very large (somewhere in the gigabytes). Our old Solr
> Master/Slave systems isn't broken up this way, so we aren't certain
> about how exactly things will map out in SolrCloud. 

If is a viable option to combine collections that use the same or
similar schemas and do filtering on the query side to reduce the total
number of collections to only a few hundred, your SolrCloud experience
will probably be better.  See this issue:

https://issues.apache.org/jira/browse/SOLR-7191

General SolrCloud stability is not very good with thousands of
collections, but I would imagine that SSD storage will improve that,
especially if the zookeeper database is also on SSD.

In a perfect world, for the best performance, you would have enough
memory across the cluster so that you can cache all of the index data
present on the cluster, including all replicas ... but for terabyte
scale indexes, that's either a huge amount of RAM on a modest number of
servers or a huge amount of servers, each with a big chunk of RAM. 
Either way it's very expensive, especially on Amazon.  Usually you can
achieve very good performance without a perfect one-to-one relationship
between index size and RAM.

The fact that you will have a lot of smaller indexes will hopefully mean
only some of them are needed at any given time.  If that's the case,
your overall memory requirements will be lower than if you had a single
1TB index, and I think the SSD storage will help the performance of
those smaller indexes a lot more than it would for very large indexes.

Thanks,
Shawn

Reply via email to