> I've looked through SolrJ, DIH and others -- is the bottomline
> across all of them to "batch updates" and not commit as long as possible?

Of course it’s more complicated than that ;)….

But to start, yes, I urge you to batch. Here’s some stats:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Note that at about 100 docs/batch you hit diminishing returns. _However_,
that test was run on a single shard collection, so if you have 10 shards you’d
have to send 1,000 docs/batch. I wouldn’t sweat that number much, just don’t
send one at a time. And there are the usual gotchas if your documents are
1M .vs. 1K.

About committing. No, don’t hold off as long as possible. When you commit,
segments are merged. _However_, the default 100M internal buffer size means
that segments are written anyway even if you don’t hit a commit point when
you have 100M of index data, and merges happen anyway. So you won’t save
anything on merging by holding off commits.
And you’ll incur penalties. Here’s more than you want to know about 
commits: 
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

But some key take-aways… If for some reason Solr abnormally 
terminates, the accumulated documents since the last hard
commit are replayed. So say you don’t commit for an hour of
furious indexing and someone does a “kill -9”. When you restart
Solr it’ll try to re-index all the docs for the last hour. Hard commits
with openSearcher=false aren’t all that expensive. I usually set mine
for a minute and forget about it.

Transaction logs hold a window, _not_ the entire set of operations
since time began. When you do a hard commit, the current tlog is
closed and a new one opened and ones that are “too old” are deleted. If
you never commit you have a huge transaction log to no good purpose.

Also, while indexing, in order to accommodate “Real Time Get”, all
the docs indexed since the last searcher was opened have a pointer
kept in memory. So if you _never_ open a new searcher, that internal
structure can get quite large. So in bulk-indexing operations, I
suggest you open a searcher every so often.

Opening a new searcher isn’t terribly expensive if you have no autowarming
going on. Autowarming as defined in solrconfig.xml in filterCache, 
queryResultCache
etc. 

So if I were going to recommend settings, they’d be something like this:
Do a hard commit with openSearcher=false every 60 seconds.
Do a soft commit every 5 minutes.

I’d actually be surprised if you were able to measure differences between
those settings and just hard commit with openSearcher=true every 60 seconds and 
soft commit at -1 (never)…

Best,
Erick

> On Jun 2, 2019, at 3:35 PM, John Davis <johndavis925...@gmail.com> wrote:
> 
> If we assume there is no query load then effectively this boils down to
> most effective way for adding a large number of documents to the solr
> index. I've looked through SolrJ, DIH and others -- is the bottomline
> across all of them to "batch updates" and not commit as long as possible?
> 
> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> Oh, there are about a zillion reasons ;).
>> 
>> First of all, most tools that show heap usage also count uncollected
>> garbage. So your 10G could actually be much less “live” data. Quick way to
>> test is to attach jconsole to the running Solr and hit the button that
>> forces a full GC.
>> 
>> Another way is to reduce your heap when you start Solr (on a test system
>> of course) until bad stuff happens, if you reduce it to very close to what
>> Solr needs, you’ll get slower as more and more cycles are spent on GC, if
>> you reduce it a little more you’ll get OOMs.
>> 
>> You can take heap dumps of course to see where all the memory is being
>> used, but that’s tricky as it also includes garbage.
>> 
>> I’ve seen cache sizes (filterCache in particular) be something that uses
>> lots of memory, but that requires queries to be fired. Each filterCache
>> entry can take up to roughly maxDoc/8 bytes + overhead….
>> 
>> A classic error is to sort, group or facet on a docValues=false field.
>> Starting with Solr 7.6, you can add an option to fields to throw an error
>> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962.
>> 
>> In short, there’s not enough information until you dive in and test
>> bunches of stuff to tell.
>> 
>> Best,
>> Erick
>> 
>> 
>>> On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com>
>> wrote:
>>> 
>>> This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
>>> index.My hypothesis was merging segments was trying to read it all but if
>>> that's not the case I am out of ideas. The one caveat is we are trying to
>>> add the documents quickly (~1g an hour) but if lucene does write 100m
>>> segments and does streaming merge it shouldn't matter?
>>> 
>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wun...@wunderwood.org>
>>> wrote:
>>> 
>>>>> On May 31, 2019, at 11:27 PM, John Davis <johndavis925...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> 2. Merging segments - does solr load the entire segment in memory or
>>>> chunks
>>>>> of it? if later how large are these chunks
>>>> 
>>>> No, it does not read the entire segment into memory.
>>>> 
>>>> A fundamental part of the Lucene design is streaming posting lists into
>>>> memory and processing them sequentially. The same amount of memory is
>>>> needed for small or large segments. Each posting list is in document-id
>>>> order. The merge is a merge of sorted lists, writing a new posting list
>> in
>>>> document-id order.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>> 
>> 
>> 

Reply via email to