On 4/2/2015 4:46 PM, Ryan Steele wrote: > Thank you Shawn and Toke for the information and links! No, I was not > the one on #solr IRC channel. :/ Here are the details I have right > now: I'm building/running the operations side of this new SolrCloud > cluster. It will be in Amazon, the initial cluster I'm planning to > start with is 5 r3.xlarge instances each using a general purpose SSD > EBS volume for the SolrCloud related data (this will be separate from > the EBS volume used by the OS). Each instance has 30.5 GiB RAM--152.5 > GiB cluster wide--and each instance has 4 vCPU's. I'm using Oracle > Java 1.8.0_31 and the G1 GC.
Java 8u40 is supposed to have some significant improvements to G1 garbage collection, so I would recommend an upgrade from 8u31. I heard this directly from Oracle engineers on a mailing list for GC issues. > The data will be indexed on a separate machine and added to the > SolrCloud cluster while searching is taking place. Unfortunately I > don't have numbers at this time on how much data will be indexed. I do > know that we will have over 2000 collections--some will be small (a > few hundred documents and only a few megabytes at most), and a few > will be very large (somewhere in the gigabytes). Our old Solr > Master/Slave systems isn't broken up this way, so we aren't certain > about how exactly things will map out in SolrCloud. If is a viable option to combine collections that use the same or similar schemas and do filtering on the query side to reduce the total number of collections to only a few hundred, your SolrCloud experience will probably be better. See this issue: https://issues.apache.org/jira/browse/SOLR-7191 General SolrCloud stability is not very good with thousands of collections, but I would imagine that SSD storage will improve that, especially if the zookeeper database is also on SSD. In a perfect world, for the best performance, you would have enough memory across the cluster so that you can cache all of the index data present on the cluster, including all replicas ... but for terabyte scale indexes, that's either a huge amount of RAM on a modest number of servers or a huge amount of servers, each with a big chunk of RAM. Either way it's very expensive, especially on Amazon. Usually you can achieve very good performance without a perfect one-to-one relationship between index size and RAM. The fact that you will have a lot of smaller indexes will hopefully mean only some of them are needed at any given time. If that's the case, your overall memory requirements will be lower than if you had a single 1TB index, and I think the SSD storage will help the performance of those smaller indexes a lot more than it would for very large indexes. Thanks, Shawn