RE: 6.6 cloud starting to eat CPU after 8+ hours

Markus Jelsma Wed, 26 Jul 2017 14:11:51 -0700

Hello Mikhail,

Spot on, there is indeed not enough heap when our nodes are in this crazy 
state. When the nodes are happy, the average heap consumption is 50 to 60 
percent, at peak when indexing there is in general enough heap for the warming 
to run smoothly.


I probably forgot to mention that high CPU is in our case also high heap 
consumption when the nodes act mad. The stack trace you saw was a crazy node 
while documents were indexed, so the blocking you mention makes sense.

I still believe this is not a heap issue but something odd in Solr that eats in 
(INSERT SITUATION) unreasonably amounts of heap. Remind, the bad node went back 
in a normal 50-60 percent heap consumption and normal CPU time after _other_ 
nodes got restarted. All nodes were bad and went normal after restart. Bad 
nodes that were not restarted then suddenly revived and also went back to 
normal. This is very odd behavior.

Observing this, i am inclined to think that Solr's inter-node communication can 
get into a weird state, eating heap, eating CPU, going bad. Using CPU or heap 
sampling it is very hard, for me at least, to quickly spot something that is 
bad so i am clueless. What do you think?

How can a bad node become normal just by restarting another bad node? Puzzling..

Thanks,
Markus
 
-----Original message-----
> From:Mikhail Khludnev <m...@apache.org>
> Sent: Wednesday 26th July 2017 10:50
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> 
> I've looked into stacktrace.
> I see that one thread triggers commit via update command. And it's blocked
> on searcher warming. The really odd thing is total state = BLOCKED. Can you
> check that there is a spare heap space available during indexing peak? And
> also that there free RAM (after heap allocation)? Can it happen that
> warming query is unnecessary heavy? Also, explicit commits might cause
> issues, consider the best practice with auto-commit and openSearcher=false
> and soft commit when necessary.
> 
> 
> On Mon, Jul 24, 2017 at 4:35 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Alright, after adding a field and full cluster restart, the cluster is
> > going nuts once again and this time almost immediately after the restart.
> >
> > I have now restarted all but one so there is some room to compare, or so i
> > thought. Now, the node i didn't restart also drops CPU-usage. This seems to
> > correspond to another incident some time ago where all nodes went crazy
> > over an extended period, but calmed down after a few were restarted. So it
> > could be a problem of inter-node communication.
> >
> > The index is is one segment at this moment but some documents are being
> > indexed. Some queries are executed but not very much. Attaching the stack
> > anyway.
> >
> >
> >
> >
> >
> > -----Original message-----
> > > From:Mikhail Khludnev <m...@apache.org>
> > > Sent: Wednesday 19th July 2017 14:41
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> > >
> > > You can get stack from kill -3 jstack even from solradmin. Overall, this
> > > behavior looks like typical heavy merge kicking off from time to time.
> > >
> > > On Wed, Jul 19, 2017 at 3:31 PM, Markus Jelsma <
> > markus.jel...@openindex.io>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > No i cannot expose the stack, VisualVM samples won't show it to me.
> > > >
> > > > I am not sure if they're about to sync all the time, but every 15
> > minutes
> > > > some documents are indexed (3 - 4k). For some reason, index time does
> > > > increase with latency / CPU usage.
> > > >
> > > > This situation runs fine for many hours, then it will slowly start to
> > go
> > > > bad, until nodes are restarted (or index size decreased).
> > > >
> > > > Thanks,
> > > > Markus
> > > >
> > > > -----Original message-----
> > > > > From:Mikhail Khludnev <m...@apache.org>
> > > > > Sent: Wednesday 19th July 2017 14:18
> > > > > To: solr-user <solr-user@lucene.apache.org>
> > > > > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> > > > >
> > > > > >
> > > > > > The real distinction between busy and calm nodes is that busy
> > nodes all
> > > > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$
> > FieldsReader.terms()
> > > > as
> > > > > > second to fillBuffer(), what are they doing?
> > > > >
> > > > >
> > > > > Can you expose the stack deeper?
> > > > > Can they start to sync shards due to some reason?
> > > > >
> > > > > On Wed, Jul 19, 2017 at 12:35 PM, Markus Jelsma <
> > > > markus.jel...@openindex.io>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Another peculiarity here, our six node (2 shards / 3 replica's)
> > > > cluster is
> > > > > > going crazy after a good part of the day has passed. It starts
> > eating
> > > > CPU
> > > > > > for no good reason and its latency goes up. Grafana graphs show the
> > > > problem
> > > > > > really well
> > > > > >
> > > > > > After restarting 2/6 nodes, there is also quite a distinction in
> > the
> > > > > > VisualVM monitor views, and the VisualVM CPU sampler reports
> > (sorted on
> > > > > > self time (CPU)). The busy nodes are deeply red in o.a.h.impl.io.
> > > > > > AbstractSessionInputBuffer.fillBuffer (as usual), the restarted
> > nodes
> > > > are
> > > > > > not.
> > > > > >
> > > > > > The real distinction between busy and calm nodes is that busy
> > nodes all
> > > > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$
> > FieldsReader.terms()
> > > > as
> > > > > > second to fillBuffer(), what are they doing?! Why? The calm nodes
> > don't
> > > > > > show this at all. Busy nodes all have o.a.l.codec stuff on top,
> > > > restarted
> > > > > > nodes don't.
> > > > > >
> > > > > > So, actually, i don't have a clue! Any, any ideas?
> > > > > >
> > > > > > Thanks,
> > > > > > Markus
> > > > > >
> > > > > > Each replica is underpowered but performing really well after
> > restart
> > > > (and
> > > > > > JVM warmup), 4 CPU's, 900M heap, 8 GB RAM, maxDoc 2.8 million,
> > index
> > > > size
> > > > > > 18 GB.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> >
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
>

RE: 6.6 cloud starting to eat CPU after 8+ hours

Reply via email to