Re: 6.6 cloud starting to eat CPU after 8+ hours

Mikhail Khludnev Wed, 26 Jul 2017 01:50:37 -0700

I've looked into stacktrace.
I see that one thread triggers commit via update command. And it's blocked
on searcher warming. The really odd thing is total state = BLOCKED. Can you
check that there is a spare heap space available during indexing peak? And
also that there free RAM (after heap allocation)? Can it happen that
warming query is unnecessary heavy? Also, explicit commits might cause
issues, consider the best practice with auto-commit and openSearcher=false
and soft commit when necessary.



On Mon, Jul 24, 2017 at 4:35 PM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Alright, after adding a field and full cluster restart, the cluster is
> going nuts once again and this time almost immediately after the restart.
>
> I have now restarted all but one so there is some room to compare, or so i
> thought. Now, the node i didn't restart also drops CPU-usage. This seems to
> correspond to another incident some time ago where all nodes went crazy
> over an extended period, but calmed down after a few were restarted. So it
> could be a problem of inter-node communication.
>
> The index is is one segment at this moment but some documents are being
> indexed. Some queries are executed but not very much. Attaching the stack
> anyway.
>
>
>
>
>
> -----Original message-----
> > From:Mikhail Khludnev <m...@apache.org>
> > Sent: Wednesday 19th July 2017 14:41
> > To: solr-user <solr-user@lucene.apache.org>
> > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> >
> > You can get stack from kill -3 jstack even from solradmin. Overall, this
> > behavior looks like typical heavy merge kicking off from time to time.
> >
> > On Wed, Jul 19, 2017 at 3:31 PM, Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> > > Hello,
> > >
> > > No i cannot expose the stack, VisualVM samples won't show it to me.
> > >
> > > I am not sure if they're about to sync all the time, but every 15
> minutes
> > > some documents are indexed (3 - 4k). For some reason, index time does
> > > increase with latency / CPU usage.
> > >
> > > This situation runs fine for many hours, then it will slowly start to
> go
> > > bad, until nodes are restarted (or index size decreased).
> > >
> > > Thanks,
> > > Markus
> > >
> > > -----Original message-----
> > > > From:Mikhail Khludnev <m...@apache.org>
> > > > Sent: Wednesday 19th July 2017 14:18
> > > > To: solr-user <solr-user@lucene.apache.org>
> > > > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> > > >
> > > > >
> > > > > The real distinction between busy and calm nodes is that busy
> nodes all
> > > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$
> FieldsReader.terms()
> > > as
> > > > > second to fillBuffer(), what are they doing?
> > > >
> > > >
> > > > Can you expose the stack deeper?
> > > > Can they start to sync shards due to some reason?
> > > >
> > > > On Wed, Jul 19, 2017 at 12:35 PM, Markus Jelsma <
> > > markus.jel...@openindex.io>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > Another peculiarity here, our six node (2 shards / 3 replica's)
> > > cluster is
> > > > > going crazy after a good part of the day has passed. It starts
> eating
> > > CPU
> > > > > for no good reason and its latency goes up. Grafana graphs show the
> > > problem
> > > > > really well
> > > > >
> > > > > After restarting 2/6 nodes, there is also quite a distinction in
> the
> > > > > VisualVM monitor views, and the VisualVM CPU sampler reports
> (sorted on
> > > > > self time (CPU)). The busy nodes are deeply red in o.a.h.impl.io.
> > > > > AbstractSessionInputBuffer.fillBuffer (as usual), the restarted
> nodes
> > > are
> > > > > not.
> > > > >
> > > > > The real distinction between busy and calm nodes is that busy
> nodes all
> > > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$
> FieldsReader.terms()
> > > as
> > > > > second to fillBuffer(), what are they doing?! Why? The calm nodes
> don't
> > > > > show this at all. Busy nodes all have o.a.l.codec stuff on top,
> > > restarted
> > > > > nodes don't.
> > > > >
> > > > > So, actually, i don't have a clue! Any, any ideas?
> > > > >
> > > > > Thanks,
> > > > > Markus
> > > > >
> > > > > Each replica is underpowered but performing really well after
> restart
> > > (and
> > > > > JVM warmup), 4 CPU's, 900M heap, 8 GB RAM, maxDoc 2.8 million,
> index
> > > size
> > > > > 18 GB.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sincerely yours
> > > > Mikhail Khludnev
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>



-- 
Sincerely yours
Mikhail Khludnev

Re: 6.6 cloud starting to eat CPU after 8+ hours

Reply via email to