On Thu, 2015-12-10 at 14:43 -0500, Susheel Kumar wrote: > Like the details here Eric how you broke memory into different parts. I > feel if we can combine lot of this knowledge from your various posts, above > sizing blog, Solr wiki pages, Uwe article on MMap/heap, consolidate and > present in at single place which may help lot of new folks/folks struggling > with memory/heap/sizing issues questions etc.
To demonstrate part of the problem: Say we have an index with documents representing employees, with three defined fields: name, company and the dynamic *_custom. Each company uses 3 dynamic fields with custom names as they see fit. Let's say we keep track of 1K companies, each with 10K employees. The full index is now total documents: 10M (1K*10K) name: 10M unique values (or less due to names not being unique) company: Exactly 1K unique values *_custom: 3K unique fields, each with 1K unique values We do our math-math-thing and arrive at an approximate index size of 5GB (just an extremely loose guess here). Heap is nothing to speak of for basic search on this, so let's set that to 1GB. We estimate that a machine with 8GB of physical RAM is more than fine for this - halving that to 4GB would probably also work well. Say we want to group on company. The "company" field is UnInverted, so there is an array of 10M pointers to 10K values. That is about 50MB overhead. No change needed to heap allocation. Say we want to filter on company and cache the filters. Each filter takes ~1MB, so that is 1000*1MB = 1GB of heap. Okay, so we bump the heap from 1 to 2GB. The 4GB machine might be a bit small here, depending on storage, but the 8GB one will work just fine. Say each company wants to facet on their custom fields. There are 3K of those fields. Each one requiring ~50MB (like the company grouping) for UnInversion. That is 150GB of heap. Yes, 150GB. What about DocValues? Well, if we just use standard String faceting, we need a map from segment-ordinals to global-ordinals for each facet field or in other words a map with 1K entries for each facet. Such a map can be represented with < 20 bits/entry (finely packed), so that is ~3KB of heap for each field or 9GB (3K*3KB) for the full range of custom fields. Still way too much for our 8GB machine. Say we change the custom fields to fixed fields named "custom1", "custom2" & "custom3" and do some name-mapping in the front-end so it just looks as if the companies chooses the names themselves. Suddenly there are only 3 larger fields to facet on instead of 3K small ones. That is 3*50MB of heap required, even without using DocValues. And we're back to our 4GB machine. But wait, the index is used quite a lot! 200 concurrent requests. Each facet request requires a counter and for the three custom fields there are 1M unique values (1000 for each company). Those counters takes up 4bytes*1M = 4MB each and for 200 concurrent requests that is 800MB + overhead. Better bump the heap with 1GB extra. Except that someone turned on threaded faceting, so we do that for the 3 custom fields at the same time, so we better bump with 2GB more. Whoops, even the 8GB machine is too small. Not sure I follow all of the above myself, but the morale should be clear: Seemingly innocuous changes to requirements or setup can easily result is huge changes to requirements. If I were to describe such things enough for another person (without previous in-depth knowledge in this field) to make educated guesses, it would be a massive amount of text with a lot of hard to grasp parts. I have tried twice and scrapped it both times as it quickly became apparent that it would be much too unwieldy. Trying to not be a wet blanket, this could also be because I have my head too far down these things. Skipping some details and making some clearly stated choices up front could work. There is no doubt that there are a lot of people that asks for estimates and "we cannot say anything" is quite a raw deal. - Toke Eskildsen, State and University Library, Denmark