I need to update that, didn’t understand the bits about retaining internal memory structures at the time.
> On Jun 4, 2019, at 2:10 AM, John Davis <johndavis925...@gmail.com> wrote: > > Erick - These conflict, what's changed? > > So if I were going to recommend settings, they’d be something like this: > Do a hard commit with openSearcher=false every 60 seconds. > Do a soft commit every 5 minutes. > > vs > > Index-heavy, Query-light > Set your soft commit interval quite long, up to the maximum latency you can > stand for documents to be visible. This could be just a couple of minutes > or much longer. Maybe even hours with the capability of issuing a hard > commit (openSearcher=true) or soft commit on demand. > https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > > > > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <erickerick...@gmail.com> > wrote: > >>> I've looked through SolrJ, DIH and others -- is the bottomline >>> across all of them to "batch updates" and not commit as long as possible? >> >> Of course it’s more complicated than that ;)…. >> >> But to start, yes, I urge you to batch. Here’s some stats: >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/ >> >> Note that at about 100 docs/batch you hit diminishing returns. _However_, >> that test was run on a single shard collection, so if you have 10 shards >> you’d >> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just >> don’t >> send one at a time. And there are the usual gotchas if your documents are >> 1M .vs. 1K. >> >> About committing. No, don’t hold off as long as possible. When you commit, >> segments are merged. _However_, the default 100M internal buffer size means >> that segments are written anyway even if you don’t hit a commit point when >> you have 100M of index data, and merges happen anyway. So you won’t save >> anything on merging by holding off commits. >> And you’ll incur penalties. Here’s more than you want to know about >> commits: >> >> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ >> >> But some key take-aways… If for some reason Solr abnormally >> terminates, the accumulated documents since the last hard >> commit are replayed. So say you don’t commit for an hour of >> furious indexing and someone does a “kill -9”. When you restart >> Solr it’ll try to re-index all the docs for the last hour. Hard commits >> with openSearcher=false aren’t all that expensive. I usually set mine >> for a minute and forget about it. >> >> Transaction logs hold a window, _not_ the entire set of operations >> since time began. When you do a hard commit, the current tlog is >> closed and a new one opened and ones that are “too old” are deleted. If >> you never commit you have a huge transaction log to no good purpose. >> >> Also, while indexing, in order to accommodate “Real Time Get”, all >> the docs indexed since the last searcher was opened have a pointer >> kept in memory. So if you _never_ open a new searcher, that internal >> structure can get quite large. So in bulk-indexing operations, I >> suggest you open a searcher every so often. >> >> Opening a new searcher isn’t terribly expensive if you have no autowarming >> going on. Autowarming as defined in solrconfig.xml in filterCache, >> queryResultCache >> etc. >> >> So if I were going to recommend settings, they’d be something like this: >> Do a hard commit with openSearcher=false every 60 seconds. >> Do a soft commit every 5 minutes. >> >> I’d actually be surprised if you were able to measure differences between >> those settings and just hard commit with openSearcher=true every 60 >> seconds and soft commit at -1 (never)… >> >> Best, >> Erick >> >>> On Jun 2, 2019, at 3:35 PM, John Davis <johndavis925...@gmail.com> >> wrote: >>> >>> If we assume there is no query load then effectively this boils down to >>> most effective way for adding a large number of documents to the solr >>> index. I've looked through SolrJ, DIH and others -- is the bottomline >>> across all of them to "batch updates" and not commit as long as possible? >>> >>> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <erickerick...@gmail.com> >>> wrote: >>> >>>> Oh, there are about a zillion reasons ;). >>>> >>>> First of all, most tools that show heap usage also count uncollected >>>> garbage. So your 10G could actually be much less “live” data. Quick way >> to >>>> test is to attach jconsole to the running Solr and hit the button that >>>> forces a full GC. >>>> >>>> Another way is to reduce your heap when you start Solr (on a test system >>>> of course) until bad stuff happens, if you reduce it to very close to >> what >>>> Solr needs, you’ll get slower as more and more cycles are spent on GC, >> if >>>> you reduce it a little more you’ll get OOMs. >>>> >>>> You can take heap dumps of course to see where all the memory is being >>>> used, but that’s tricky as it also includes garbage. >>>> >>>> I’ve seen cache sizes (filterCache in particular) be something that uses >>>> lots of memory, but that requires queries to be fired. Each filterCache >>>> entry can take up to roughly maxDoc/8 bytes + overhead…. >>>> >>>> A classic error is to sort, group or facet on a docValues=false field. >>>> Starting with Solr 7.6, you can add an option to fields to throw an >> error >>>> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962. >>>> >>>> In short, there’s not enough information until you dive in and test >>>> bunches of stuff to tell. >>>> >>>> Best, >>>> Erick >>>> >>>> >>>>> On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com> >>>> wrote: >>>>> >>>>> This makes sense, any ideas why lucene/solr will use 10g heap for a 20g >>>>> index.My hypothesis was merging segments was trying to read it all but >> if >>>>> that's not the case I am out of ideas. The one caveat is we are trying >> to >>>>> add the documents quickly (~1g an hour) but if lucene does write 100m >>>>> segments and does streaming merge it shouldn't matter? >>>>> >>>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wun...@wunderwood.org >>> >>>>> wrote: >>>>> >>>>>>> On May 31, 2019, at 11:27 PM, John Davis <johndavis925...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> 2. Merging segments - does solr load the entire segment in memory or >>>>>> chunks >>>>>>> of it? if later how large are these chunks >>>>>> >>>>>> No, it does not read the entire segment into memory. >>>>>> >>>>>> A fundamental part of the Lucene design is streaming posting lists >> into >>>>>> memory and processing them sequentially. The same amount of memory is >>>>>> needed for small or large segments. Each posting list is in >> document-id >>>>>> order. The merge is a merge of sorted lists, writing a new posting >> list >>>> in >>>>>> document-id order. >>>>>> >>>>>> wunder >>>>>> Walter Underwood >>>>>> wun...@wunderwood.org >>>>>> http://observer.wunderwood.org/ (my blog) >>>>>> >>>>>> >>>> >>>> >> >>