If we assume there is no query load then effectively this boils down to
most effective way for adding a large number of documents to the solr
index. I've looked through SolrJ, DIH and others -- is the bottomline
across all of them to "batch updates" and not commit as long as possible?

On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> Oh, there are about a zillion reasons ;).
>
> First of all, most tools that show heap usage also count uncollected
> garbage. So your 10G could actually be much less “live” data. Quick way to
> test is to attach jconsole to the running Solr and hit the button that
> forces a full GC.
>
> Another way is to reduce your heap when you start Solr (on a test system
> of course) until bad stuff happens, if you reduce it to very close to what
> Solr needs, you’ll get slower as more and more cycles are spent on GC, if
> you reduce it a little more you’ll get OOMs.
>
> You can take heap dumps of course to see where all the memory is being
> used, but that’s tricky as it also includes garbage.
>
> I’ve seen cache sizes (filterCache in particular) be something that uses
> lots of memory, but that requires queries to be fired. Each filterCache
> entry can take up to roughly maxDoc/8 bytes + overhead….
>
> A classic error is to sort, group or facet on a docValues=false field.
> Starting with Solr 7.6, you can add an option to fields to throw an error
> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962.
>
> In short, there’s not enough information until you dive in and test
> bunches of stuff to tell.
>
> Best,
> Erick
>
>
> > On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com>
> wrote:
> >
> > This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
> > index.My hypothesis was merging segments was trying to read it all but if
> > that's not the case I am out of ideas. The one caveat is we are trying to
> > add the documents quickly (~1g an hour) but if lucene does write 100m
> > segments and does streaming merge it shouldn't matter?
> >
> > On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wun...@wunderwood.org>
> > wrote:
> >
> >>> On May 31, 2019, at 11:27 PM, John Davis <johndavis925...@gmail.com>
> >> wrote:
> >>>
> >>> 2. Merging segments - does solr load the entire segment in memory or
> >> chunks
> >>> of it? if later how large are these chunks
> >>
> >> No, it does not read the entire segment into memory.
> >>
> >> A fundamental part of the Lucene design is streaming posting lists into
> >> memory and processing them sequentially. The same amount of memory is
> >> needed for small or large segments. Each posting list is in document-id
> >> order. The merge is a merge of sorted lists, writing a new posting list
> in
> >> document-id order.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
>
>

Reply via email to