Just a couple of points I’d make here. I did some testing a while back in
which if no commit is made, (hard or soft) there are internal memory
structures holding tlogs and it will continue to get worse the more docs
that come in. I don’t know if that’s changed in further versions. I’d
recommend doing commits with some amount of frequency in indexing heavy
apps, otherwise you are likely to have heap issues. I personally would
advocate for some of the points already made. There are too many variables
going on here and ways to modify stuff to make sizing decisions and think
you’re doing anything other than a pure guess if you don’t test and
monitor. I’d advocate for a process in which testing is done regularly to
figure out questions like number of shards/replicas, heap size, memory etc.
Hard data, good process and regular testing will trump guesswork every time

Greg

On Tue, Jun 4, 2019 at 9:22 AM John Davis <johndavis925...@gmail.com> wrote:

> You might want to test with softcommit of hours vs 5m for heavy indexing +
> light query -- even though there is internal memory structure overhead for
> no soft commits, in our testing a 5m soft commit (via commitWithin) has
> resulted in a very very large heap usage which I suspect is because of
> other overhead associated with it.
>
> On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > I need to update that, didn’t understand the bits about retaining
> internal
> > memory structures at the time.
> >
> > > On Jun 4, 2019, at 2:10 AM, John Davis <johndavis925...@gmail.com>
> > wrote:
> > >
> > > Erick - These conflict, what's changed?
> > >
> > > So if I were going to recommend settings, they’d be something like
> this:
> > > Do a hard commit with openSearcher=false every 60 seconds.
> > > Do a soft commit every 5 minutes.
> > >
> > > vs
> > >
> > > Index-heavy, Query-light
> > > Set your soft commit interval quite long, up to the maximum latency you
> > can
> > > stand for documents to be visible. This could be just a couple of
> minutes
> > > or much longer. Maybe even hours with the capability of issuing a hard
> > > commit (openSearcher=true) or soft commit on demand.
> > >
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > >
> > >
> > >
> > >
> > > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <erickerick...@gmail.com
> >
> > > wrote:
> > >
> > >>> I've looked through SolrJ, DIH and others -- is the bottomline
> > >>> across all of them to "batch updates" and not commit as long as
> > possible?
> > >>
> > >> Of course it’s more complicated than that ;)….
> > >>
> > >> But to start, yes, I urge you to batch. Here’s some stats:
> > >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
> > >>
> > >> Note that at about 100 docs/batch you hit diminishing returns.
> > _However_,
> > >> that test was run on a single shard collection, so if you have 10
> shards
> > >> you’d
> > >> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
> > >> don’t
> > >> send one at a time. And there are the usual gotchas if your documents
> > are
> > >> 1M .vs. 1K.
> > >>
> > >> About committing. No, don’t hold off as long as possible. When you
> > commit,
> > >> segments are merged. _However_, the default 100M internal buffer size
> > means
> > >> that segments are written anyway even if you don’t hit a commit point
> > when
> > >> you have 100M of index data, and merges happen anyway. So you won’t
> save
> > >> anything on merging by holding off commits.
> > >> And you’ll incur penalties. Here’s more than you want to know about
> > >> commits:
> > >>
> > >>
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > >>
> > >> But some key take-aways… If for some reason Solr abnormally
> > >> terminates, the accumulated documents since the last hard
> > >> commit are replayed. So say you don’t commit for an hour of
> > >> furious indexing and someone does a “kill -9”. When you restart
> > >> Solr it’ll try to re-index all the docs for the last hour. Hard
> commits
> > >> with openSearcher=false aren’t all that expensive. I usually set mine
> > >> for a minute and forget about it.
> > >>
> > >> Transaction logs hold a window, _not_ the entire set of operations
> > >> since time began. When you do a hard commit, the current tlog is
> > >> closed and a new one opened and ones that are “too old” are deleted.
> If
> > >> you never commit you have a huge transaction log to no good purpose.
> > >>
> > >> Also, while indexing, in order to accommodate “Real Time Get”, all
> > >> the docs indexed since the last searcher was opened have a pointer
> > >> kept in memory. So if you _never_ open a new searcher, that internal
> > >> structure can get quite large. So in bulk-indexing operations, I
> > >> suggest you open a searcher every so often.
> > >>
> > >> Opening a new searcher isn’t terribly expensive if you have no
> > autowarming
> > >> going on. Autowarming as defined in solrconfig.xml in filterCache,
> > >> queryResultCache
> > >> etc.
> > >>
> > >> So if I were going to recommend settings, they’d be something like
> this:
> > >> Do a hard commit with openSearcher=false every 60 seconds.
> > >> Do a soft commit every 5 minutes.
> > >>
> > >> I’d actually be surprised if you were able to measure differences
> > between
> > >> those settings and just hard commit with openSearcher=true every 60
> > >> seconds and soft commit at -1 (never)…
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>> On Jun 2, 2019, at 3:35 PM, John Davis <johndavis925...@gmail.com>
> > >> wrote:
> > >>>
> > >>> If we assume there is no query load then effectively this boils down
> to
> > >>> most effective way for adding a large number of documents to the solr
> > >>> index. I've looked through SolrJ, DIH and others -- is the bottomline
> > >>> across all of them to "batch updates" and not commit as long as
> > possible?
> > >>>
> > >>> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <
> erickerick...@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>>> Oh, there are about a zillion reasons ;).
> > >>>>
> > >>>> First of all, most tools that show heap usage also count uncollected
> > >>>> garbage. So your 10G could actually be much less “live” data. Quick
> > way
> > >> to
> > >>>> test is to attach jconsole to the running Solr and hit the button
> that
> > >>>> forces a full GC.
> > >>>>
> > >>>> Another way is to reduce your heap when you start Solr (on a test
> > system
> > >>>> of course) until bad stuff happens, if you reduce it to very close
> to
> > >> what
> > >>>> Solr needs, you’ll get slower as more and more cycles are spent on
> GC,
> > >> if
> > >>>> you reduce it a little more you’ll get OOMs.
> > >>>>
> > >>>> You can take heap dumps of course to see where all the memory is
> being
> > >>>> used, but that’s tricky as it also includes garbage.
> > >>>>
> > >>>> I’ve seen cache sizes (filterCache in particular) be something that
> > uses
> > >>>> lots of memory, but that requires queries to be fired. Each
> > filterCache
> > >>>> entry can take up to roughly maxDoc/8 bytes + overhead….
> > >>>>
> > >>>> A classic error is to sort, group or facet on a docValues=false
> field.
> > >>>> Starting with Solr 7.6, you can add an option to fields to throw an
> > >> error
> > >>>> if you do this, see:
> https://issues.apache.org/jira/browse/SOLR-12962
> > .
> > >>>>
> > >>>> In short, there’s not enough information until you dive in and test
> > >>>> bunches of stuff to tell.
> > >>>>
> > >>>> Best,
> > >>>> Erick
> > >>>>
> > >>>>
> > >>>>> On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com>
> > >>>> wrote:
> > >>>>>
> > >>>>> This makes sense, any ideas why lucene/solr will use 10g heap for a
> > 20g
> > >>>>> index.My hypothesis was merging segments was trying to read it all
> > but
> > >> if
> > >>>>> that's not the case I am out of ideas. The one caveat is we are
> > trying
> > >> to
> > >>>>> add the documents quickly (~1g an hour) but if lucene does write
> 100m
> > >>>>> segments and does streaming merge it shouldn't matter?
> > >>>>>
> > >>>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <
> > wun...@wunderwood.org
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>>> On May 31, 2019, at 11:27 PM, John Davis <
> > johndavis925...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> 2. Merging segments - does solr load the entire segment in memory
> > or
> > >>>>>> chunks
> > >>>>>>> of it? if later how large are these chunks
> > >>>>>>
> > >>>>>> No, it does not read the entire segment into memory.
> > >>>>>>
> > >>>>>> A fundamental part of the Lucene design is streaming posting lists
> > >> into
> > >>>>>> memory and processing them sequentially. The same amount of memory
> > is
> > >>>>>> needed for small or large segments. Each posting list is in
> > >> document-id
> > >>>>>> order. The merge is a merge of sorted lists, writing a new posting
> > >> list
> > >>>> in
> > >>>>>> document-id order.
> > >>>>>>
> > >>>>>> wunder
> > >>>>>> Walter Underwood
> > >>>>>> wun...@wunderwood.org
> > >>>>>> http://observer.wunderwood.org/  (my blog)
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Reply via email to