On Thu, 2015-12-10 at 14:43 -0500, Susheel Kumar wrote:
> Like the details here Eric how you broke memory into different parts. I
> feel if we can combine lot of this knowledge from your various posts, above
> sizing blog, Solr wiki pages, Uwe article on MMap/heap,  consolidate and
> present in at single place which may help lot of new folks/folks struggling
> with memory/heap/sizing issues questions etc.

To demonstrate part of the problem:

Say we have an index with documents representing employees, with three
defined fields: name, company and the dynamic *_custom. Each company
uses 3 dynamic fields with custom names as they see fit.

Let's say we keep track of 1K companies, each with 10K employees.

The full index is now

  total documents: 10M (1K*10K)
  name: 10M unique values (or less due to names not being unique)
  company: Exactly 1K unique values
  *_custom: 3K unique fields, each with 1K unique values

We do our math-math-thing and arrive at an approximate index size of 5GB
(just an extremely loose guess here). Heap is nothing to speak of for
basic search on this, so let's set that to 1GB. We estimate that a
machine with 8GB of physical RAM is more than fine for this - halving
that to 4GB would probably also work well.

Say we want to group on company. The "company" field is UnInverted, so
there is an array of 10M pointers to 10K values. That is about 50MB
overhead. No change needed to heap allocation.

Say we want to filter on company and cache the filters. Each filter
takes ~1MB, so that is 1000*1MB = 1GB of heap. Okay, so we bump the heap
from 1 to 2GB. The 4GB machine might be a bit small here, depending on
storage, but the 8GB one will work just fine.

Say each company wants to facet on their custom fields. There are 3K of
those fields. Each one requiring ~50MB (like the company grouping) for
UnInversion. That is 150GB of heap. Yes, 150GB.


What about DocValues? Well, if we just use standard String faceting, we
need a map from segment-ordinals to global-ordinals for each facet field
or in other words a map with 1K entries for each facet. Such a map can
be represented with < 20 bits/entry (finely packed), so that is ~3KB of
heap for each field or 9GB (3K*3KB) for the full range of custom fields.
Still way too much for our 8GB machine.

Say we change the custom fields to fixed fields named "custom1",
"custom2" & "custom3" and do some name-mapping in the front-end so it
just looks as if the companies chooses the names themselves.
Suddenly there are only 3 larger fields to facet on instead of 3K small
ones. That is 3*50MB of heap required, even without using DocValues.
And we're back to our 4GB machine.

But wait, the index is used quite a lot! 200 concurrent requests. Each
facet request requires a counter and for the three custom fields there
are 1M unique values (1000 for each company). Those counters takes up
4bytes*1M = 4MB each and for 200 concurrent requests that is 800MB +
overhead. Better bump the heap with 1GB extra.

Except that someone turned on threaded faceting, so we do that for the 3
custom fields at the same time, so we better bump with 2GB more. Whoops,
even the 8GB machine is too small.



Not sure I follow all of the above myself, but the morale should be
clear: Seemingly innocuous changes to requirements or setup can easily
result is huge changes to requirements. If I were to describe such
things enough for another person (without previous in-depth knowledge in
this field) to make educated guesses, it would be a massive amount of
text with a lot of hard to grasp parts. I have tried twice and scrapped
it both times as it quickly became apparent that it would be much too
unwieldy.

Trying to not be a wet blanket, this could also be because I have my
head too far down these things. Skipping some details and making some
clearly stated choices up front could work. There is no doubt that there
are a lot of people that asks for estimates and "we cannot say anything"
is quite a raw deal.


- Toke Eskildsen, State and University Library, Denmark


Reply via email to