I need to update that, didn’t understand the bits about retaining internal 
memory structures at the time.

> On Jun 4, 2019, at 2:10 AM, John Davis <johndavis925...@gmail.com> wrote:
> 
> Erick - These conflict, what's changed?
> 
> So if I were going to recommend settings, they’d be something like this:
> Do a hard commit with openSearcher=false every 60 seconds.
> Do a soft commit every 5 minutes.
> 
> vs
> 
> Index-heavy, Query-light
> Set your soft commit interval quite long, up to the maximum latency you can
> stand for documents to be visible. This could be just a couple of minutes
> or much longer. Maybe even hours with the capability of issuing a hard
> commit (openSearcher=true) or soft commit on demand.
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> 
> 
> 
> 
> On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>>> I've looked through SolrJ, DIH and others -- is the bottomline
>>> across all of them to "batch updates" and not commit as long as possible?
>> 
>> Of course it’s more complicated than that ;)….
>> 
>> But to start, yes, I urge you to batch. Here’s some stats:
>> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
>> 
>> Note that at about 100 docs/batch you hit diminishing returns. _However_,
>> that test was run on a single shard collection, so if you have 10 shards
>> you’d
>> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
>> don’t
>> send one at a time. And there are the usual gotchas if your documents are
>> 1M .vs. 1K.
>> 
>> About committing. No, don’t hold off as long as possible. When you commit,
>> segments are merged. _However_, the default 100M internal buffer size means
>> that segments are written anyway even if you don’t hit a commit point when
>> you have 100M of index data, and merges happen anyway. So you won’t save
>> anything on merging by holding off commits.
>> And you’ll incur penalties. Here’s more than you want to know about
>> commits:
>> 
>> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>> 
>> But some key take-aways… If for some reason Solr abnormally
>> terminates, the accumulated documents since the last hard
>> commit are replayed. So say you don’t commit for an hour of
>> furious indexing and someone does a “kill -9”. When you restart
>> Solr it’ll try to re-index all the docs for the last hour. Hard commits
>> with openSearcher=false aren’t all that expensive. I usually set mine
>> for a minute and forget about it.
>> 
>> Transaction logs hold a window, _not_ the entire set of operations
>> since time began. When you do a hard commit, the current tlog is
>> closed and a new one opened and ones that are “too old” are deleted. If
>> you never commit you have a huge transaction log to no good purpose.
>> 
>> Also, while indexing, in order to accommodate “Real Time Get”, all
>> the docs indexed since the last searcher was opened have a pointer
>> kept in memory. So if you _never_ open a new searcher, that internal
>> structure can get quite large. So in bulk-indexing operations, I
>> suggest you open a searcher every so often.
>> 
>> Opening a new searcher isn’t terribly expensive if you have no autowarming
>> going on. Autowarming as defined in solrconfig.xml in filterCache,
>> queryResultCache
>> etc.
>> 
>> So if I were going to recommend settings, they’d be something like this:
>> Do a hard commit with openSearcher=false every 60 seconds.
>> Do a soft commit every 5 minutes.
>> 
>> I’d actually be surprised if you were able to measure differences between
>> those settings and just hard commit with openSearcher=true every 60
>> seconds and soft commit at -1 (never)…
>> 
>> Best,
>> Erick
>> 
>>> On Jun 2, 2019, at 3:35 PM, John Davis <johndavis925...@gmail.com>
>> wrote:
>>> 
>>> If we assume there is no query load then effectively this boils down to
>>> most effective way for adding a large number of documents to the solr
>>> index. I've looked through SolrJ, DIH and others -- is the bottomline
>>> across all of them to "batch updates" and not commit as long as possible?
>>> 
>>> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <erickerick...@gmail.com>
>>> wrote:
>>> 
>>>> Oh, there are about a zillion reasons ;).
>>>> 
>>>> First of all, most tools that show heap usage also count uncollected
>>>> garbage. So your 10G could actually be much less “live” data. Quick way
>> to
>>>> test is to attach jconsole to the running Solr and hit the button that
>>>> forces a full GC.
>>>> 
>>>> Another way is to reduce your heap when you start Solr (on a test system
>>>> of course) until bad stuff happens, if you reduce it to very close to
>> what
>>>> Solr needs, you’ll get slower as more and more cycles are spent on GC,
>> if
>>>> you reduce it a little more you’ll get OOMs.
>>>> 
>>>> You can take heap dumps of course to see where all the memory is being
>>>> used, but that’s tricky as it also includes garbage.
>>>> 
>>>> I’ve seen cache sizes (filterCache in particular) be something that uses
>>>> lots of memory, but that requires queries to be fired. Each filterCache
>>>> entry can take up to roughly maxDoc/8 bytes + overhead….
>>>> 
>>>> A classic error is to sort, group or facet on a docValues=false field.
>>>> Starting with Solr 7.6, you can add an option to fields to throw an
>> error
>>>> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962.
>>>> 
>>>> In short, there’s not enough information until you dive in and test
>>>> bunches of stuff to tell.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> 
>>>>> On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
>>>>> index.My hypothesis was merging segments was trying to read it all but
>> if
>>>>> that's not the case I am out of ideas. The one caveat is we are trying
>> to
>>>>> add the documents quickly (~1g an hour) but if lucene does write 100m
>>>>> segments and does streaming merge it shouldn't matter?
>>>>> 
>>>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wun...@wunderwood.org
>>> 
>>>>> wrote:
>>>>> 
>>>>>>> On May 31, 2019, at 11:27 PM, John Davis <johndavis925...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> 2. Merging segments - does solr load the entire segment in memory or
>>>>>> chunks
>>>>>>> of it? if later how large are these chunks
>>>>>> 
>>>>>> No, it does not read the entire segment into memory.
>>>>>> 
>>>>>> A fundamental part of the Lucene design is streaming posting lists
>> into
>>>>>> memory and processing them sequentially. The same amount of memory is
>>>>>> needed for small or large segments. Each posting list is in
>> document-id
>>>>>> order. The merge is a merge of sorted lists, writing a new posting
>> list
>>>> in
>>>>>> document-id order.
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wun...@wunderwood.org
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 

Reply via email to