Oh, there are about a zillion reasons ;).

First of all, most tools that show heap usage also count uncollected garbage. 
So your 10G could actually be much less “live” data. Quick way to test is to 
attach jconsole to the running Solr and hit the button that forces a full GC.

Another way is to reduce your heap when you start Solr (on a test system of 
course) until bad stuff happens, if you reduce it to very close to what Solr 
needs, you’ll get slower as more and more cycles are spent on GC, if you reduce 
it a little more you’ll get OOMs.

You can take heap dumps of course to see where all the memory is being used, 
but that’s tricky as it also includes garbage.

I’ve seen cache sizes (filterCache in particular) be something that uses lots 
of memory, but that requires queries to be fired. Each filterCache entry can 
take up to roughly maxDoc/8 bytes + overhead….

A classic error is to sort, group or facet on a docValues=false field. Starting 
with Solr 7.6, you can add an option to fields to throw an error if you do 
this, see: https://issues.apache.org/jira/browse/SOLR-12962.

In short, there’s not enough information until you dive in and test bunches of 
stuff to tell.

Best,
Erick


> On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com> wrote:
> 
> This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
> index.My hypothesis was merging segments was trying to read it all but if
> that's not the case I am out of ideas. The one caveat is we are trying to
> add the documents quickly (~1g an hour) but if lucene does write 100m
> segments and does streaming merge it shouldn't matter?
> 
> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wun...@wunderwood.org>
> wrote:
> 
>>> On May 31, 2019, at 11:27 PM, John Davis <johndavis925...@gmail.com>
>> wrote:
>>> 
>>> 2. Merging segments - does solr load the entire segment in memory or
>> chunks
>>> of it? if later how large are these chunks
>> 
>> No, it does not read the entire segment into memory.
>> 
>> A fundamental part of the Lucene design is streaming posting lists into
>> memory and processing them sequentially. The same amount of memory is
>> needed for small or large segments. Each posting list is in document-id
>> order. The merge is a merge of sorted lists, writing a new posting list in
>> document-id order.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 

Reply via email to