If we assume there is no query load then effectively this boils down to most effective way for adding a large number of documents to the solr index. I've looked through SolrJ, DIH and others -- is the bottomline across all of them to "batch updates" and not commit as long as possible?
On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <erickerick...@gmail.com> wrote: > Oh, there are about a zillion reasons ;). > > First of all, most tools that show heap usage also count uncollected > garbage. So your 10G could actually be much less “live” data. Quick way to > test is to attach jconsole to the running Solr and hit the button that > forces a full GC. > > Another way is to reduce your heap when you start Solr (on a test system > of course) until bad stuff happens, if you reduce it to very close to what > Solr needs, you’ll get slower as more and more cycles are spent on GC, if > you reduce it a little more you’ll get OOMs. > > You can take heap dumps of course to see where all the memory is being > used, but that’s tricky as it also includes garbage. > > I’ve seen cache sizes (filterCache in particular) be something that uses > lots of memory, but that requires queries to be fired. Each filterCache > entry can take up to roughly maxDoc/8 bytes + overhead…. > > A classic error is to sort, group or facet on a docValues=false field. > Starting with Solr 7.6, you can add an option to fields to throw an error > if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962. > > In short, there’s not enough information until you dive in and test > bunches of stuff to tell. > > Best, > Erick > > > > On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com> > wrote: > > > > This makes sense, any ideas why lucene/solr will use 10g heap for a 20g > > index.My hypothesis was merging segments was trying to read it all but if > > that's not the case I am out of ideas. The one caveat is we are trying to > > add the documents quickly (~1g an hour) but if lucene does write 100m > > segments and does streaming merge it shouldn't matter? > > > > On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wun...@wunderwood.org> > > wrote: > > > >>> On May 31, 2019, at 11:27 PM, John Davis <johndavis925...@gmail.com> > >> wrote: > >>> > >>> 2. Merging segments - does solr load the entire segment in memory or > >> chunks > >>> of it? if later how large are these chunks > >> > >> No, it does not read the entire segment into memory. > >> > >> A fundamental part of the Lucene design is streaming posting lists into > >> memory and processing them sequentially. The same amount of memory is > >> needed for small or large segments. Each posting list is in document-id > >> order. The merge is a merge of sorted lists, writing a new posting list > in > >> document-id order. > >> > >> wunder > >> Walter Underwood > >> wun...@wunderwood.org > >> http://observer.wunderwood.org/ (my blog) > >> > >> > >