Just to give a specific answer to the original question, I would say that dozens of cores (collections) is certainly fine (assuming the total data load and query rate is reasonable), maybe 50 or even 100. Low hundreds of cores/collections MAY work, but isn't advisable. Thousands, if it works at all, is probably just asking for trouble and likely to be far more hassle than it could possible be worth.
Whether the number for you ends up being 37, 50, 75, 100, 237, or 1273, you will have to do a proof of concept implementation to validate it. I'm not sure where we are at these days for lazy-loading of cores. That may work for you with hundreds (thousands?!) of cores/collections for tenants who are mostly idle or dormant, but if the server is running long enough, it may build up a lot of memory usage for collections that were active but have gone idle after days or weeks. -- Jack Krupansky On Wed, Mar 25, 2015 at 2:49 AM, Shai Erera <ser...@gmail.com> wrote: > While it's hard to answer this question because as others have said, "it > depends", I think it will be good of we can quantify or assess the cost of > running a SolrCore. > > For instance, let's say that a server can handle a load of 10M indexed > documents (I omit search load on purpose for now) in a single SolrCore. > Would the same server be able to handle the same number of documents, If we > indexed 1000 docs per SolrCore, in total of 10,000 SorClores? If the answer > is no, then it means there is some cost that comes w/ each SolrCore, and we > may at least be able to give an upper bound --- on a server with X amount > of storage, Y GB RAM and Z cores you can run up to maxSolrCores(X, Y, Z). > > Another way to look at it, if I were to create empty SolrCores, would I be > able to create an infinite number of cores if storage was infinite? Or even > empty cores have their toll on CPU and RAM? > > I know from the Lucene side of things that each SolrCore (carries a Lucene > index) there is a toll to an index -- the lexicon, IW's RAM buffer, Codecs > that store things in memory etc. For instance, one downside of splitting a > 10M core into 10,000 cores is that the cost of the holding the total > lexicon (dictionary of indexed words) goes up drastically, since now every > word (just the byte[] of the word) is potentially represented in memory > 10,000 times. > > What other RAM/CPU/Storage costs does a SolrCore carry with it? There are > the caches of course, which really depend on how many documents are > indexed. Any other non-trivial or constant cost? > > So yes, there isn't a single answer to this question. It's just like > someone would ask how many documents can a single Lucene index handle > efficiently. But if we can come up with basic numbers as I outlined above, > it might help people doing rough estimates. That doesn't mean people > shouldn't benchmark, as that upper bound may be waaaay too high for their > data set, query workload and search needs. > > Shai > > On Wed, Mar 25, 2015 at 5:25 AM, Damien Kamerman <dami...@gmail.com> > wrote: > > > From my experience on a high-end sever (256GB memory, 40 core CPU) > testing > > collection numbers with one shard and two replicas, the maximum that > would > > work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps > > half of that), depending on your startup-time requirements. (Though I > have > > settled on 6,000 collection maximum with some patching. See SOLR-7191). > You > > could create multiple clouds after that, and choose the cloud least used > to > > create your collection. > > > > Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap > per > > collection. > > > > On 25 March 2015 at 13:46, Ian Rose <ianr...@fullstory.com> wrote: > > > > > First off thanks everyone for the very useful replies thus far. > > > > > > Shawn - thanks for the list of items to check. #1 and #2 should be > fine > > > for us and I'll check our ulimit for #3. > > > > > > To add a bit of clarification, we are indeed using SolrCloud. Our > > current > > > setup is to create a new collection for each customer. For now we > allow > > > SolrCloud to decide for itself where to locate the initial shard(s) but > > in > > > time we expect to refine this such that our system will automatically > > > choose the least loaded nodes according to some metric(s). > > > > > > Having more than one business entity controlling the configuration of a > > > > single (Solr) server is a recipe for disaster. Solr works well if > there > > > is > > > > an architect for the system. > > > > > > > > > Jack, can you explain a bit what you mean here? It looks like Toke > > caught > > > your meaning but I'm afraid it missed me. What do you mean by > "business > > > entity"? Is your concern that with automatic creation of collections > > they > > > will be distributed willy-nilly across the cluster, leading to uneven > > load > > > across nodes? If it is relevant, the schema and solrconfig are > > controlled > > > entirely by me and is the same for all collections. Thus theoretically > > we > > > could actually just use one single collection for all of our customers > > > (adding a 'customer:<whatever>' type fq to all queries) but since we > > never > > > need to query across customers it seemed more performant (as well as > > safer > > > - less chance of accidentally leaking data across customers) to use > > > separate collections. > > > > > > Better to give each tenant a separate Solr instance that you spin up > and > > > > spin down based on demand. > > > > > > > > > Regarding this, if by tenant you mean "customer", this is not viable > for > > us > > > from a cost perspective. As I mentioned initially, many of our > customers > > > are very small so dedicating an entire machine to each of them would > not > > be > > > economical (or efficient). Or perhaps I am not understanding what your > > > definition of "tenant" is? > > > > > > Cheers, > > > Ian > > > > > > > > > > > > On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen < > t...@statsbiblioteket.dk> > > > wrote: > > > > > > > Jack Krupansky [jack.krupan...@gmail.com] wrote: > > > > > I'm sure that I am quite unqualified to describe his hypothetical > > > setup. > > > > I > > > > > mean, he's the one using the term multi-tenancy, so it's for him to > > be > > > > > clear. > > > > > > > > It was my understanding that Ian used them interchangeably, but of > > course > > > > Ian it the only one that knows. > > > > > > > > > For me, it's a question of who has control over the config and > schema > > > and > > > > > collection creation. Having more than one business entity > controlling > > > the > > > > > configuration of a single (Solr) server is a recipe for disaster. > > > > > > > > Thank you. Now your post makes a lot more sense. I will not argue > > against > > > > that. > > > > > > > > - Toke Eskildsen > > > > > > > > > > > > > > > -- > > Damien Kamerman > > >