Re: rough maximum cores (shards) per machine?

Shai Erera Tue, 24 Mar 2015 23:51:01 -0700

While it's hard to answer this question because as others have said, "it
depends", I think it will be good of we can quantify or assess the cost of
running a SolrCore.

For instance, let's say that a server can handle a load of 10M indexed
documents (I omit search load on purpose for now) in a single SolrCore.
Would the same server be able to handle the same number of documents, If we
indexed 1000 docs per SolrCore, in total of 10,000 SorClores? If the answer
is no, then it means there is some cost that comes w/ each SolrCore, and we
may at least be able to give an upper bound --- on a server with X amount
of storage, Y GB RAM and Z cores you can run up to maxSolrCores(X, Y, Z).

Another way to look at it, if I were to create empty SolrCores, would I be
able to create an infinite number of cores if storage was infinite? Or even
empty cores have their toll on CPU and RAM?

I know from the Lucene side of things that each SolrCore (carries a Lucene
index) there is a toll to an index -- the lexicon, IW's RAM buffer, Codecs
that store things in memory etc. For instance, one downside of splitting a
10M core into 10,000 cores is that the cost of the holding the total
lexicon (dictionary of indexed words) goes up drastically, since now every
word (just the byte[] of the word) is potentially represented in memory
10,000 times.

What other RAM/CPU/Storage costs does a SolrCore carry with it? There are
the caches of course, which really depend on how many documents are
indexed. Any other non-trivial or constant cost?

So yes, there isn't a single answer to this question. It's just like
someone would ask how many documents can a single Lucene index handle
efficiently. But if we can come up with basic numbers as I outlined above,
it might help people doing rough estimates. That doesn't mean people
shouldn't benchmark, as that upper bound may be waaaay too high for their
data set, query workload and search needs.

Shai

On Wed, Mar 25, 2015 at 5:25 AM, Damien Kamerman <dami...@gmail.com> wrote:

> From my experience on a high-end sever (256GB memory, 40 core CPU) testing
> collection numbers with one shard and two replicas, the maximum that would
> work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps
> half of that), depending on your startup-time requirements. (Though I have
> settled on 6,000 collection maximum with some patching. See SOLR-7191). You
> could create multiple clouds after that, and choose the cloud least used to
> create your collection.
>
> Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap per
> collection.
>
> On 25 March 2015 at 13:46, Ian Rose <ianr...@fullstory.com> wrote:
>
> > First off thanks everyone for the very useful replies thus far.
> >
> > Shawn - thanks for the list of items to check.  #1 and #2 should be fine
> > for us and I'll check our ulimit for #3.
> >
> > To add a bit of clarification, we are indeed using SolrCloud.  Our
> current
> > setup is to create a new collection for each customer.  For now we allow
> > SolrCloud to decide for itself where to locate the initial shard(s) but
> in
> > time we expect to refine this such that our system will automatically
> > choose the least loaded nodes according to some metric(s).
> >
> > Having more than one business entity controlling the configuration of a
> > > single (Solr) server is a recipe for disaster. Solr works well if there
> > is
> > > an architect for the system.
> >
> >
> > Jack, can you explain a bit what you mean here?  It looks like Toke
> caught
> > your meaning but I'm afraid it missed me.  What do you mean by "business
> > entity"?  Is your concern that with automatic creation of collections
> they
> > will be distributed willy-nilly across the cluster, leading to uneven
> load
> > across nodes?  If it is relevant, the schema and solrconfig are
> controlled
> > entirely by me and is the same for all collections.  Thus theoretically
> we
> > could actually just use one single collection for all of our customers
> > (adding a 'customer:<whatever>' type fq to all queries) but since we
> never
> > need to query across customers it seemed more performant (as well as
> safer
> > - less chance of accidentally leaking data across customers) to use
> > separate collections.
> >
> > Better to give each tenant a separate Solr instance that you spin up and
> > > spin down based on demand.
> >
> >
> > Regarding this, if by tenant you mean "customer", this is not viable for
> us
> > from a cost perspective.  As I mentioned initially, many of our customers
> > are very small so dedicating an entire machine to each of them would not
> be
> > economical (or efficient).  Or perhaps I am not understanding what your
> > definition of "tenant" is?
> >
> > Cheers,
> > Ian
> >
> >
> >
> > On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen <t...@statsbiblioteket.dk>
> > wrote:
> >
> > > Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > > > I'm sure that I am quite unqualified to describe his hypothetical
> > setup.
> > > I
> > > > mean, he's the one using the term multi-tenancy, so it's for him to
> be
> > > > clear.
> > >
> > > It was my understanding that Ian used them interchangeably, but of
> course
> > > Ian it the only one that knows.
> > >
> > > > For me, it's a question of who has control over the config and schema
> > and
> > > > collection creation. Having more than one business entity controlling
> > the
> > > > configuration of a single (Solr) server is a recipe for disaster.
> > >
> > > Thank you. Now your post makes a lot more sense. I will not argue
> against
> > > that.
> > >
> > > - Toke Eskildsen
> > >
> >
>
>
>
> --
> Damien Kamerman
>

Re: rough maximum cores (shards) per machine?

Reply via email to