On 11/13/2017 12:33 PM, Shamik Bandopadhyay wrote: > I'm looking for some input on design considerations for defining > collections in a SolrCloud cluster. Right now, our cluster consists of two > collections in a 2 shard / 2 replica mode. Each collection has a dedicated > set of source and don't overlap, which made it an easy decision. > Recently, we've a requirement to index a bunch of new sources that are > region based. The search result corresponding to those region needs to come > from their specific source as well sources from one of our existing > collection. Here's an example of our existing collection and their > corresponding source(s).
You haven't defined in *ANY* way exactly what a "source" is or how that data actually gets into Solr. Without that information, it'll be difficult to even understand your requirements. If I make one assumption that for all of the data sources, the config and schema are going to be identical, then I can give you this information: If you set up each source as a collection in your SolrCloud, you can create collection aliases that let you query multiple collections with one query. Whether or not this will work correctly will depend on a few factors, but most of all whether or not all the data is using the same (or extremely similar) Solr config/schema. > The other consideration is the hardware design. Right now, both shards and > their replicas run on their dedicated instance. With two collections, we > sometimes run into OOM scenarios, so I'm a little bit worried about adding > more collections. Does the best practice (I know it's subjective) in > scenarios like this call for a dedicated Solr cluster per collection? From > index size perspective, Source_C,Source_D and Source_E combines close to10 > million documents with 60gb volume size. Each geo based source is small, > won't exceed more than 500k documents. 10 million documents producing 60GB of index data means that the documents are relatively large, but aren't super huge -- or that the data in them is duplicated several times. For contrast, I have an index where each shard has about 30 million docs, and each of those shards is 36GB in size. The entire index has six of these large shards and one tiny hot shard. I always get a little anxious when somebody wants best practice information about Solr configurations and hardware. Any recommendation that we make will be COMPLETELY wrong for some use cases, indexes, and/or query patterns. Solr configurations and hardware must be tailored specifically for the use case, index data, and query patterns that actually exist. Typically, this means that you have to actually set up a full system and try it to make any determinations about how much hardware you need. https://lucidworks.com/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Regarding your hardware sizing, the only general advice I can give you is this: Good performance usually ends up requiring significantly more RAM than users plan on. Thanks, Shawn