I object to the question. And the advice. And... ;).

Practically, IMO guidance that "the entire index should
fit into memory" is misleading, especially for newbies.
Let's break it down:

1>  "the entire index". What's this? The size on disk?
90% of that size on disk may be stored data which
uses very little memory, which is limited by the
documentCache in Solr. OTOH, only 10% of the on-disk
size might be stored data.

2> "fit into memory". What memory? Certainly not
the JVM as much of the Lucene-level data is in
MMapDirectory which uses the OS memory. So
this _probably_ means JVM + OS memory, and OS
memory is shared amongst other processes as well.

3> Solr and Lucene build in-memory structures that
aren't reflected in the index size on disk. I've seen
filterCaches for instance that have been (mis) configured
that could grow to 100s of G. This is totally not reflected in
the "index size".

4> Try faceting on a text field with lots of unique
values. Bad Practice, but you'll see just how quickly
the _query_ can change the memory requirements.

5> Sure, with modern hardware we can create huge JVM
heaps... that hit GC pauses that'll drive performance
down, sometimes radically.

I've seen 350M docs, 200-300 fields (aggregate) fit into 12G
of JVM. I've seen 25M docs (really big ones) strain 48G
JVM heaps.

Jack's approach is what I use; pick a number and test with it.
Here's an approach:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Wed, Dec 9, 2015 at 8:54 AM, Susheel Kumar <susheel2...@gmail.com> wrote:
> Thanks, Jack for quick reply.  With Replica / Shard I mean to say on a
> given machine there may be two/more replicas and all of them may not fit
> into memory.
>
> On Wed, Dec 9, 2015 at 11:00 AM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> Yes, there are nuances to any general rule. It's just a starting point, and
>> your own testing will confirm specific details for your specific app and
>> data. For example, maybe you don't query all fields commonly, so each
>> field-specific index may not require memory or not require it so commonly.
>> And, yes, each app has its own latency requirements. The purpose of a
>> general rule is to generally avoid unhappiness, but if you have an appetite
>> and tolerance for unhappiness, then go for it.
>>
>> Replica vs. shard? They're basically the same - a replica is a copy of a
>> shard.
>>
>> -- Jack Krupansky
>>
>> On Wed, Dec 9, 2015 at 10:36 AM, Susheel Kumar <susheel2...@gmail.com>
>> wrote:
>>
>> > Hi Jack,
>> >
>> > Just to add, OS Disk Cache will still make query performant even though
>> > entire index can't be loaded into memory. How much more latency compare
>> to
>> > if index gets completely loaded into memory may vary depending to index
>> > size etc.  I am trying to clarify this here because lot of folks takes
>> this
>> > as a hard guideline (to fit index into memory)  and try to come up with
>> > hardware/machines (100's of machines) just for the sake of fitting index
>> > into memory even though there may not be much load/qps on the cluster.
>> For
>> > e.g. this may vary and needs to be tested on case by case basis but a
>> > machine with 64GB  should still provide good performance (not the best)
>> for
>> > 100G index on that machine.  Do you agree / any thoughts?
>> >
>> > Same i believe is the case with Replicas,   as on a single machine you
>> have
>> > replicas which itself may not fit into memory as well along with shard
>> > index.
>> >
>> > Thanks,
>> > Susheel
>> >
>> > On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky <
>> jack.krupan...@gmail.com>
>> > wrote:
>> >
>> > > Generally, you will be resource limited (memory, cpu) rather than by
>> some
>> > > arbitrary numeric limit (like 2 billion.)
>> > >
>> > > My personal general recommendation is for a practical limit is 100
>> > million
>> > > documents on a machine/node. Depending on your data model and actual
>> data
>> > > that number could be higher or lower. A proof of concept test will
>> allow
>> > > you to determine the actual number for your particular use case, but a
>> > > presumed limit of 100 million is not a bad start.
>> > >
>> > > You should have enough memory to hold the entire index in system
>> memory.
>> > If
>> > > not, your query latency will suffer due to I/O required to constantly
>> > > re-read portions of the index into memory.
>> > >
>> > > The practical limit for documents is not per core or number of cores
>> but
>> > > across all cores on the node since it is mostly a memory limit and the
>> > > available CPU resources for accessing that memory.
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen <t...@statsbiblioteket.dk
>> >
>> > > wrote:
>> > >
>> > > > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
>> > > > > Capacity regarding 2 simple question:
>> > > > >
>> > > > > 1.) How many document we could store in single core(capacity of
>> core
>> > > > > storage)
>> > > >
>> > > > There is hard limit of 2 billion documents.
>> > > >
>> > > > > 2.) How many core we could create in a single server(single node
>> > > cluster)
>> > > >
>> > > > There is no hard limit. Except for 2 billion cores, I guess. But at
>> > this
>> > > > point in time that is a ridiculously high number of cores.
>> > > >
>> > > > It is hard to give a suggestion for real-world limits as indexes
>> vary a
>> > > > lot and the rules of thumb tend to be quite poor when scaling up.
>> > > >
>> > > >
>> > >
>> >
>> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>> > > >
>> > > > People generally seems to run into problems with more than 1000
>> > > > not-too-large cores. If the cores are large, there will probably be
>> > > > performance problems long before that.
>> > > >
>> > > > You will have to build a prototype and test.
>> > > >
>> > > > - Toke Eskildsen, State and University Library, Denmark
>> > > >
>> > > >
>> > > >
>> > >
>> >
>>

Reply via email to