Thanks, Alessandro. We can attempt to come up with such a blog and I can volunteer for bullets/headings to start with. I also agree that we can can't come up with some definitive answer as mentioned in other places but can give an attempt to at least consolidate all these knowledge into one place. As of now i see few sources which can be referred to come up with some consolidated knowledge
https://wiki.apache.org/solr/SolrPerformanceProblems http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Uwe's Article on MMAP Erick's and others valuable posts On Fri, Dec 11, 2015 at 6:20 AM, Alessandro Benedetti <abenede...@apache.org > wrote: > Susheel, this is a very good idea. > I am a little bit busy this period, so I doubt I can contribute with a blog > post, but it would be great if anyone has time. > If not I will add it to my backlog and sooner or later I will do it :) > > Furthermore latest observations from Erick are pure gold, and I agree > completely. > I have only a question related this : > > 1> "the entire index". What's this? The size on disk? > > 90% of that size on disk may be stored data which > > uses very little memory, which is limited by the > > documentCache in Solr. OTOH, only 10% of the on-disk > > size might be stored data. > > > If I am correct the documentCache in Solr is a map that relates the Lucene > document ordinal to the stored fields for that document. > We have control on that and we can assign our preferred values. > First question : > 1) Is this using the JVM memory to store this cache ? I assume yes. > So we need to take care of our JVM memory if we want to store in memory big > chunks of the stored index. > > 2) MMap index segments are actually only the segments used for searching ? > Is not the Lucene directory memory mapping the stored segments as well ? > This was my understanding but maybe I am wrong. > In the case we first memory map the stored segments and then potentially > store them on the Solr cache as well, right ? > > Cheers > > > On 10 December 2015 at 19:43, Susheel Kumar <susheel2...@gmail.com> wrote: > > > Like the details here Eric how you broke memory into different parts. I > > feel if we can combine lot of this knowledge from your various posts, > above > > sizing blog, Solr wiki pages, Uwe article on MMap/heap, consolidate and > > present in at single place which may help lot of new folks/folks > struggling > > with memory/heap/sizing issues questions etc. > > > > Thanks, > > Susheel > > > > On Wed, Dec 9, 2015 at 12:40 PM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > > > I object to the question. And the advice. And... ;). > > > > > > Practically, IMO guidance that "the entire index should > > > fit into memory" is misleading, especially for newbies. > > > Let's break it down: > > > > > > 1> "the entire index". What's this? The size on disk? > > > 90% of that size on disk may be stored data which > > > uses very little memory, which is limited by the > > > documentCache in Solr. OTOH, only 10% of the on-disk > > > size might be stored data. > > > > > > 2> "fit into memory". What memory? Certainly not > > > the JVM as much of the Lucene-level data is in > > > MMapDirectory which uses the OS memory. So > > > this _probably_ means JVM + OS memory, and OS > > > memory is shared amongst other processes as well. > > > > > > 3> Solr and Lucene build in-memory structures that > > > aren't reflected in the index size on disk. I've seen > > > filterCaches for instance that have been (mis) configured > > > that could grow to 100s of G. This is totally not reflected in > > > the "index size". > > > > > > 4> Try faceting on a text field with lots of unique > > > values. Bad Practice, but you'll see just how quickly > > > the _query_ can change the memory requirements. > > > > > > 5> Sure, with modern hardware we can create huge JVM > > > heaps... that hit GC pauses that'll drive performance > > > down, sometimes radically. > > > > > > I've seen 350M docs, 200-300 fields (aggregate) fit into 12G > > > of JVM. I've seen 25M docs (really big ones) strain 48G > > > JVM heaps. > > > > > > Jack's approach is what I use; pick a number and test with it. > > > Here's an approach: > > > > > > > > > https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ > > > > > > Best, > > > Erick > > > > > > On Wed, Dec 9, 2015 at 8:54 AM, Susheel Kumar <susheel2...@gmail.com> > > > wrote: > > > > Thanks, Jack for quick reply. With Replica / Shard I mean to say on > a > > > > given machine there may be two/more replicas and all of them may not > > fit > > > > into memory. > > > > > > > > On Wed, Dec 9, 2015 at 11:00 AM, Jack Krupansky < > > > jack.krupan...@gmail.com> > > > > wrote: > > > > > > > >> Yes, there are nuances to any general rule. It's just a starting > > point, > > > and > > > >> your own testing will confirm specific details for your specific app > > and > > > >> data. For example, maybe you don't query all fields commonly, so > each > > > >> field-specific index may not require memory or not require it so > > > commonly. > > > >> And, yes, each app has its own latency requirements. The purpose of > a > > > >> general rule is to generally avoid unhappiness, but if you have an > > > appetite > > > >> and tolerance for unhappiness, then go for it. > > > >> > > > >> Replica vs. shard? They're basically the same - a replica is a copy > > of a > > > >> shard. > > > >> > > > >> -- Jack Krupansky > > > >> > > > >> On Wed, Dec 9, 2015 at 10:36 AM, Susheel Kumar < > susheel2...@gmail.com > > > > > > >> wrote: > > > >> > > > >> > Hi Jack, > > > >> > > > > >> > Just to add, OS Disk Cache will still make query performant even > > > though > > > >> > entire index can't be loaded into memory. How much more latency > > > compare > > > >> to > > > >> > if index gets completely loaded into memory may vary depending to > > > index > > > >> > size etc. I am trying to clarify this here because lot of folks > > takes > > > >> this > > > >> > as a hard guideline (to fit index into memory) and try to come up > > > with > > > >> > hardware/machines (100's of machines) just for the sake of fitting > > > index > > > >> > into memory even though there may not be much load/qps on the > > cluster. > > > >> For > > > >> > e.g. this may vary and needs to be tested on case by case basis > but > > a > > > >> > machine with 64GB should still provide good performance (not the > > > best) > > > >> for > > > >> > 100G index on that machine. Do you agree / any thoughts? > > > >> > > > > >> > Same i believe is the case with Replicas, as on a single machine > > you > > > >> have > > > >> > replicas which itself may not fit into memory as well along with > > shard > > > >> > index. > > > >> > > > > >> > Thanks, > > > >> > Susheel > > > >> > > > > >> > On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky < > > > >> jack.krupan...@gmail.com> > > > >> > wrote: > > > >> > > > > >> > > Generally, you will be resource limited (memory, cpu) rather > than > > by > > > >> some > > > >> > > arbitrary numeric limit (like 2 billion.) > > > >> > > > > > >> > > My personal general recommendation is for a practical limit is > 100 > > > >> > million > > > >> > > documents on a machine/node. Depending on your data model and > > actual > > > >> data > > > >> > > that number could be higher or lower. A proof of concept test > will > > > >> allow > > > >> > > you to determine the actual number for your particular use case, > > > but a > > > >> > > presumed limit of 100 million is not a bad start. > > > >> > > > > > >> > > You should have enough memory to hold the entire index in system > > > >> memory. > > > >> > If > > > >> > > not, your query latency will suffer due to I/O required to > > > constantly > > > >> > > re-read portions of the index into memory. > > > >> > > > > > >> > > The practical limit for documents is not per core or number of > > cores > > > >> but > > > >> > > across all cores on the node since it is mostly a memory limit > and > > > the > > > >> > > available CPU resources for accessing that memory. > > > >> > > > > > >> > > -- Jack Krupansky > > > >> > > > > > >> > > On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen < > > > t...@statsbiblioteket.dk > > > >> > > > > >> > > wrote: > > > >> > > > > > >> > > > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote: > > > >> > > > > Capacity regarding 2 simple question: > > > >> > > > > > > > >> > > > > 1.) How many document we could store in single core(capacity > > of > > > >> core > > > >> > > > > storage) > > > >> > > > > > > >> > > > There is hard limit of 2 billion documents. > > > >> > > > > > > >> > > > > 2.) How many core we could create in a single server(single > > node > > > >> > > cluster) > > > >> > > > > > > >> > > > There is no hard limit. Except for 2 billion cores, I guess. > But > > > at > > > >> > this > > > >> > > > point in time that is a ridiculously high number of cores. > > > >> > > > > > > >> > > > It is hard to give a suggestion for real-world limits as > indexes > > > >> vary a > > > >> > > > lot and the rules of thumb tend to be quite poor when scaling > > up. > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ > > > >> > > > > > > >> > > > People generally seems to run into problems with more than > 1000 > > > >> > > > not-too-large cores. If the cores are large, there will > probably > > > be > > > >> > > > performance problems long before that. > > > >> > > > > > > >> > > > You will have to build a prototype and test. > > > >> > > > > > > >> > > > - Toke Eskildsen, State and University Library, Denmark > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > -- > -------------------------- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England >