Like the details here Eric how you broke memory into different parts. I
feel if we can combine lot of this knowledge from your various posts, above
sizing blog, Solr wiki pages, Uwe article on MMap/heap,  consolidate and
present in at single place which may help lot of new folks/folks struggling
with memory/heap/sizing issues questions etc.

Thanks,
Susheel

On Wed, Dec 9, 2015 at 12:40 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> I object to the question. And the advice. And... ;).
>
> Practically, IMO guidance that "the entire index should
> fit into memory" is misleading, especially for newbies.
> Let's break it down:
>
> 1>  "the entire index". What's this? The size on disk?
> 90% of that size on disk may be stored data which
> uses very little memory, which is limited by the
> documentCache in Solr. OTOH, only 10% of the on-disk
> size might be stored data.
>
> 2> "fit into memory". What memory? Certainly not
> the JVM as much of the Lucene-level data is in
> MMapDirectory which uses the OS memory. So
> this _probably_ means JVM + OS memory, and OS
> memory is shared amongst other processes as well.
>
> 3> Solr and Lucene build in-memory structures that
> aren't reflected in the index size on disk. I've seen
> filterCaches for instance that have been (mis) configured
> that could grow to 100s of G. This is totally not reflected in
> the "index size".
>
> 4> Try faceting on a text field with lots of unique
> values. Bad Practice, but you'll see just how quickly
> the _query_ can change the memory requirements.
>
> 5> Sure, with modern hardware we can create huge JVM
> heaps... that hit GC pauses that'll drive performance
> down, sometimes radically.
>
> I've seen 350M docs, 200-300 fields (aggregate) fit into 12G
> of JVM. I've seen 25M docs (really big ones) strain 48G
> JVM heaps.
>
> Jack's approach is what I use; pick a number and test with it.
> Here's an approach:
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Best,
> Erick
>
> On Wed, Dec 9, 2015 at 8:54 AM, Susheel Kumar <susheel2...@gmail.com>
> wrote:
> > Thanks, Jack for quick reply.  With Replica / Shard I mean to say on a
> > given machine there may be two/more replicas and all of them may not fit
> > into memory.
> >
> > On Wed, Dec 9, 2015 at 11:00 AM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> >> Yes, there are nuances to any general rule. It's just a starting point,
> and
> >> your own testing will confirm specific details for your specific app and
> >> data. For example, maybe you don't query all fields commonly, so each
> >> field-specific index may not require memory or not require it so
> commonly.
> >> And, yes, each app has its own latency requirements. The purpose of a
> >> general rule is to generally avoid unhappiness, but if you have an
> appetite
> >> and tolerance for unhappiness, then go for it.
> >>
> >> Replica vs. shard? They're basically the same - a replica is a copy of a
> >> shard.
> >>
> >> -- Jack Krupansky
> >>
> >> On Wed, Dec 9, 2015 at 10:36 AM, Susheel Kumar <susheel2...@gmail.com>
> >> wrote:
> >>
> >> > Hi Jack,
> >> >
> >> > Just to add, OS Disk Cache will still make query performant even
> though
> >> > entire index can't be loaded into memory. How much more latency
> compare
> >> to
> >> > if index gets completely loaded into memory may vary depending to
> index
> >> > size etc.  I am trying to clarify this here because lot of folks takes
> >> this
> >> > as a hard guideline (to fit index into memory)  and try to come up
> with
> >> > hardware/machines (100's of machines) just for the sake of fitting
> index
> >> > into memory even though there may not be much load/qps on the cluster.
> >> For
> >> > e.g. this may vary and needs to be tested on case by case basis but a
> >> > machine with 64GB  should still provide good performance (not the
> best)
> >> for
> >> > 100G index on that machine.  Do you agree / any thoughts?
> >> >
> >> > Same i believe is the case with Replicas,   as on a single machine you
> >> have
> >> > replicas which itself may not fit into memory as well along with shard
> >> > index.
> >> >
> >> > Thanks,
> >> > Susheel
> >> >
> >> > On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky <
> >> jack.krupan...@gmail.com>
> >> > wrote:
> >> >
> >> > > Generally, you will be resource limited (memory, cpu) rather than by
> >> some
> >> > > arbitrary numeric limit (like 2 billion.)
> >> > >
> >> > > My personal general recommendation is for a practical limit is 100
> >> > million
> >> > > documents on a machine/node. Depending on your data model and actual
> >> data
> >> > > that number could be higher or lower. A proof of concept test will
> >> allow
> >> > > you to determine the actual number for your particular use case,
> but a
> >> > > presumed limit of 100 million is not a bad start.
> >> > >
> >> > > You should have enough memory to hold the entire index in system
> >> memory.
> >> > If
> >> > > not, your query latency will suffer due to I/O required to
> constantly
> >> > > re-read portions of the index into memory.
> >> > >
> >> > > The practical limit for documents is not per core or number of cores
> >> but
> >> > > across all cores on the node since it is mostly a memory limit and
> the
> >> > > available CPU resources for accessing that memory.
> >> > >
> >> > > -- Jack Krupansky
> >> > >
> >> > > On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen <
> t...@statsbiblioteket.dk
> >> >
> >> > > wrote:
> >> > >
> >> > > > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
> >> > > > > Capacity regarding 2 simple question:
> >> > > > >
> >> > > > > 1.) How many document we could store in single core(capacity of
> >> core
> >> > > > > storage)
> >> > > >
> >> > > > There is hard limit of 2 billion documents.
> >> > > >
> >> > > > > 2.) How many core we could create in a single server(single node
> >> > > cluster)
> >> > > >
> >> > > > There is no hard limit. Except for 2 billion cores, I guess. But
> at
> >> > this
> >> > > > point in time that is a ridiculously high number of cores.
> >> > > >
> >> > > > It is hard to give a suggestion for real-world limits as indexes
> >> vary a
> >> > > > lot and the rules of thumb tend to be quite poor when scaling up.
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >> > > >
> >> > > > People generally seems to run into problems with more than 1000
> >> > > > not-too-large cores. If the cores are large, there will probably
> be
> >> > > > performance problems long before that.
> >> > > >
> >> > > > You will have to build a prototype and test.
> >> > > >
> >> > > > - Toke Eskildsen, State and University Library, Denmark
> >> > > >
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
>

Reply via email to