Re: 6.6 cloud starting to eat CPU after 8+ hours

Lars Karlsson Thu, 27 Jul 2017 15:45:16 -0700

Can you reproduce with 4G heap?

On Wed, 26 Jul 2017 at 23:11, Markus Jelsma <markus.jel...@openindex.io>
wrote:


> Hello Mikhail,
>
> Spot on, there is indeed not enough heap when our nodes are in this crazy
> state. When the nodes are happy, the average heap consumption is 50 to 60
> percent, at peak when indexing there is in general enough heap for the
> warming to run smoothly.
>
> I probably forgot to mention that high CPU is in our case also high heap
> consumption when the nodes act mad. The stack trace you saw was a crazy
> node while documents were indexed, so the blocking you mention makes sense.
>
> I still believe this is not a heap issue but something odd in Solr that
> eats in (INSERT SITUATION) unreasonably amounts of heap. Remind, the bad
> node went back in a normal 50-60 percent heap consumption and normal CPU
> time after _other_ nodes got restarted. All nodes were bad and went normal
> after restart. Bad nodes that were not restarted then suddenly revived and
> also went back to normal. This is very odd behavior.
>
> Observing this, i am inclined to think that Solr's inter-node
> communication can get into a weird state, eating heap, eating CPU, going
> bad. Using CPU or heap sampling it is very hard, for me at least, to
> quickly spot something that is bad so i am clueless. What do you think?
>
> How can a bad node become normal just by restarting another bad node?
> Puzzling..
>
> Thanks,
> Markus
>
> -----Original message-----
> > From:Mikhail Khludnev <m...@apache.org>
> > Sent: Wednesday 26th July 2017 10:50
> > To: solr-user <solr-user@lucene.apache.org>
> > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> >
> > I've looked into stacktrace.
> > I see that one thread triggers commit via update command. And it's
> blocked
> > on searcher warming. The really odd thing is total state = BLOCKED. Can
> you
> > check that there is a spare heap space available during indexing peak?
> And
> > also that there free RAM (after heap allocation)? Can it happen that
> > warming query is unnecessary heavy? Also, explicit commits might cause
> > issues, consider the best practice with auto-commit and
> openSearcher=false
> > and soft commit when necessary.
> >
> >
> > On Mon, Jul 24, 2017 at 4:35 PM, Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> > > Alright, after adding a field and full cluster restart, the cluster is
> > > going nuts once again and this time almost immediately after the
> restart.
> > >
> > > I have now restarted all but one so there is some room to compare, or
> so i
> > > thought. Now, the node i didn't restart also drops CPU-usage. This
> seems to
> > > correspond to another incident some time ago where all nodes went crazy
> > > over an extended period, but calmed down after a few were restarted.
> So it
> > > could be a problem of inter-node communication.
> > >
> > > The index is is one segment at this moment but some documents are being
> > > indexed. Some queries are executed but not very much. Attaching the
> stack
> > > anyway.
> > >
> > >
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Mikhail Khludnev <m...@apache.org>
> > > > Sent: Wednesday 19th July 2017 14:41
> > > > To: solr-user <solr-user@lucene.apache.org>
> > > > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> > > >
> > > > You can get stack from kill -3 jstack even from solradmin. Overall,
> this
> > > > behavior looks like typical heavy merge kicking off from time to
> time.
> > > >
> > > > On Wed, Jul 19, 2017 at 3:31 PM, Markus Jelsma <
> > > markus.jel...@openindex.io>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > No i cannot expose the stack, VisualVM samples won't show it to me.
> > > > >
> > > > > I am not sure if they're about to sync all the time, but every 15
> > > minutes
> > > > > some documents are indexed (3 - 4k). For some reason, index time
> does
> > > > > increase with latency / CPU usage.
> > > > >
> > > > > This situation runs fine for many hours, then it will slowly start
> to
> > > go
> > > > > bad, until nodes are restarted (or index size decreased).
> > > > >
> > > > > Thanks,
> > > > > Markus
> > > > >
> > > > > -----Original message-----
> > > > > > From:Mikhail Khludnev <m...@apache.org>
> > > > > > Sent: Wednesday 19th July 2017 14:18
> > > > > > To: solr-user <solr-user@lucene.apache.org>
> > > > > > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> > > > > >
> > > > > > >
> > > > > > > The real distinction between busy and calm nodes is that busy
> > > nodes all
> > > > > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$
> > > FieldsReader.terms()
> > > > > as
> > > > > > > second to fillBuffer(), what are they doing?
> > > > > >
> > > > > >
> > > > > > Can you expose the stack deeper?
> > > > > > Can they start to sync shards due to some reason?
> > > > > >
> > > > > > On Wed, Jul 19, 2017 at 12:35 PM, Markus Jelsma <
> > > > > markus.jel...@openindex.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > Another peculiarity here, our six node (2 shards / 3 replica's)
> > > > > cluster is
> > > > > > > going crazy after a good part of the day has passed. It starts
> > > eating
> > > > > CPU
> > > > > > > for no good reason and its latency goes up. Grafana graphs
> show the
> > > > > problem
> > > > > > > really well
> > > > > > >
> > > > > > > After restarting 2/6 nodes, there is also quite a distinction
> in
> > > the
> > > > > > > VisualVM monitor views, and the VisualVM CPU sampler reports
> > > (sorted on
> > > > > > > self time (CPU)). The busy nodes are deeply red in
> o.a.h.impl.io.
> > > > > > > AbstractSessionInputBuffer.fillBuffer (as usual), the restarted
> > > nodes
> > > > > are
> > > > > > > not.
> > > > > > >
> > > > > > > The real distinction between busy and calm nodes is that busy
> > > nodes all
> > > > > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$
> > > FieldsReader.terms()
> > > > > as
> > > > > > > second to fillBuffer(), what are they doing?! Why? The calm
> nodes
> > > don't
> > > > > > > show this at all. Busy nodes all have o.a.l.codec stuff on top,
> > > > > restarted
> > > > > > > nodes don't.
> > > > > > >
> > > > > > > So, actually, i don't have a clue! Any, any ideas?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Markus
> > > > > > >
> > > > > > > Each replica is underpowered but performing really well after
> > > restart
> > > > > (and
> > > > > > > JVM warmup), 4 CPU's, 900M heap, 8 GB RAM, maxDoc 2.8 million,
> > > index
> > > > > size
> > > > > > > 18 GB.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sincerely yours
> > > > > > Mikhail Khludnev
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sincerely yours
> > > > Mikhail Khludnev
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>

Re: 6.6 cloud starting to eat CPU after 8+ hours

Reply via email to