> I've looked through SolrJ, DIH and others -- is the bottomline > across all of them to "batch updates" and not commit as long as possible?
Of course it’s more complicated than that ;)…. But to start, yes, I urge you to batch. Here’s some stats: https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/ Note that at about 100 docs/batch you hit diminishing returns. _However_, that test was run on a single shard collection, so if you have 10 shards you’d have to send 1,000 docs/batch. I wouldn’t sweat that number much, just don’t send one at a time. And there are the usual gotchas if your documents are 1M .vs. 1K. About committing. No, don’t hold off as long as possible. When you commit, segments are merged. _However_, the default 100M internal buffer size means that segments are written anyway even if you don’t hit a commit point when you have 100M of index data, and merges happen anyway. So you won’t save anything on merging by holding off commits. And you’ll incur penalties. Here’s more than you want to know about commits: https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ But some key take-aways… If for some reason Solr abnormally terminates, the accumulated documents since the last hard commit are replayed. So say you don’t commit for an hour of furious indexing and someone does a “kill -9”. When you restart Solr it’ll try to re-index all the docs for the last hour. Hard commits with openSearcher=false aren’t all that expensive. I usually set mine for a minute and forget about it. Transaction logs hold a window, _not_ the entire set of operations since time began. When you do a hard commit, the current tlog is closed and a new one opened and ones that are “too old” are deleted. If you never commit you have a huge transaction log to no good purpose. Also, while indexing, in order to accommodate “Real Time Get”, all the docs indexed since the last searcher was opened have a pointer kept in memory. So if you _never_ open a new searcher, that internal structure can get quite large. So in bulk-indexing operations, I suggest you open a searcher every so often. Opening a new searcher isn’t terribly expensive if you have no autowarming going on. Autowarming as defined in solrconfig.xml in filterCache, queryResultCache etc. So if I were going to recommend settings, they’d be something like this: Do a hard commit with openSearcher=false every 60 seconds. Do a soft commit every 5 minutes. I’d actually be surprised if you were able to measure differences between those settings and just hard commit with openSearcher=true every 60 seconds and soft commit at -1 (never)… Best, Erick > On Jun 2, 2019, at 3:35 PM, John Davis <johndavis925...@gmail.com> wrote: > > If we assume there is no query load then effectively this boils down to > most effective way for adding a large number of documents to the solr > index. I've looked through SolrJ, DIH and others -- is the bottomline > across all of them to "batch updates" and not commit as long as possible? > > On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <erickerick...@gmail.com> > wrote: > >> Oh, there are about a zillion reasons ;). >> >> First of all, most tools that show heap usage also count uncollected >> garbage. So your 10G could actually be much less “live” data. Quick way to >> test is to attach jconsole to the running Solr and hit the button that >> forces a full GC. >> >> Another way is to reduce your heap when you start Solr (on a test system >> of course) until bad stuff happens, if you reduce it to very close to what >> Solr needs, you’ll get slower as more and more cycles are spent on GC, if >> you reduce it a little more you’ll get OOMs. >> >> You can take heap dumps of course to see where all the memory is being >> used, but that’s tricky as it also includes garbage. >> >> I’ve seen cache sizes (filterCache in particular) be something that uses >> lots of memory, but that requires queries to be fired. Each filterCache >> entry can take up to roughly maxDoc/8 bytes + overhead…. >> >> A classic error is to sort, group or facet on a docValues=false field. >> Starting with Solr 7.6, you can add an option to fields to throw an error >> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962. >> >> In short, there’s not enough information until you dive in and test >> bunches of stuff to tell. >> >> Best, >> Erick >> >> >>> On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com> >> wrote: >>> >>> This makes sense, any ideas why lucene/solr will use 10g heap for a 20g >>> index.My hypothesis was merging segments was trying to read it all but if >>> that's not the case I am out of ideas. The one caveat is we are trying to >>> add the documents quickly (~1g an hour) but if lucene does write 100m >>> segments and does streaming merge it shouldn't matter? >>> >>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wun...@wunderwood.org> >>> wrote: >>> >>>>> On May 31, 2019, at 11:27 PM, John Davis <johndavis925...@gmail.com> >>>> wrote: >>>>> >>>>> 2. Merging segments - does solr load the entire segment in memory or >>>> chunks >>>>> of it? if later how large are these chunks >>>> >>>> No, it does not read the entire segment into memory. >>>> >>>> A fundamental part of the Lucene design is streaming posting lists into >>>> memory and processing them sequentially. The same amount of memory is >>>> needed for small or large segments. Each posting list is in document-id >>>> order. The merge is a merge of sorted lists, writing a new posting list >> in >>>> document-id order. >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org >>>> http://observer.wunderwood.org/ (my blog) >>>> >>>> >> >>