Can you reproduce with 4G heap? On Wed, 26 Jul 2017 at 23:11, Markus Jelsma <markus.jel...@openindex.io> wrote:
> Hello Mikhail, > > Spot on, there is indeed not enough heap when our nodes are in this crazy > state. When the nodes are happy, the average heap consumption is 50 to 60 > percent, at peak when indexing there is in general enough heap for the > warming to run smoothly. > > I probably forgot to mention that high CPU is in our case also high heap > consumption when the nodes act mad. The stack trace you saw was a crazy > node while documents were indexed, so the blocking you mention makes sense. > > I still believe this is not a heap issue but something odd in Solr that > eats in (INSERT SITUATION) unreasonably amounts of heap. Remind, the bad > node went back in a normal 50-60 percent heap consumption and normal CPU > time after _other_ nodes got restarted. All nodes were bad and went normal > after restart. Bad nodes that were not restarted then suddenly revived and > also went back to normal. This is very odd behavior. > > Observing this, i am inclined to think that Solr's inter-node > communication can get into a weird state, eating heap, eating CPU, going > bad. Using CPU or heap sampling it is very hard, for me at least, to > quickly spot something that is bad so i am clueless. What do you think? > > How can a bad node become normal just by restarting another bad node? > Puzzling.. > > Thanks, > Markus > > -----Original message----- > > From:Mikhail Khludnev <m...@apache.org> > > Sent: Wednesday 26th July 2017 10:50 > > To: solr-user <solr-user@lucene.apache.org> > > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours > > > > I've looked into stacktrace. > > I see that one thread triggers commit via update command. And it's > blocked > > on searcher warming. The really odd thing is total state = BLOCKED. Can > you > > check that there is a spare heap space available during indexing peak? > And > > also that there free RAM (after heap allocation)? Can it happen that > > warming query is unnecessary heavy? Also, explicit commits might cause > > issues, consider the best practice with auto-commit and > openSearcher=false > > and soft commit when necessary. > > > > > > On Mon, Jul 24, 2017 at 4:35 PM, Markus Jelsma < > markus.jel...@openindex.io> > > wrote: > > > > > Alright, after adding a field and full cluster restart, the cluster is > > > going nuts once again and this time almost immediately after the > restart. > > > > > > I have now restarted all but one so there is some room to compare, or > so i > > > thought. Now, the node i didn't restart also drops CPU-usage. This > seems to > > > correspond to another incident some time ago where all nodes went crazy > > > over an extended period, but calmed down after a few were restarted. > So it > > > could be a problem of inter-node communication. > > > > > > The index is is one segment at this moment but some documents are being > > > indexed. Some queries are executed but not very much. Attaching the > stack > > > anyway. > > > > > > > > > > > > > > > > > > -----Original message----- > > > > From:Mikhail Khludnev <m...@apache.org> > > > > Sent: Wednesday 19th July 2017 14:41 > > > > To: solr-user <solr-user@lucene.apache.org> > > > > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours > > > > > > > > You can get stack from kill -3 jstack even from solradmin. Overall, > this > > > > behavior looks like typical heavy merge kicking off from time to > time. > > > > > > > > On Wed, Jul 19, 2017 at 3:31 PM, Markus Jelsma < > > > markus.jel...@openindex.io> > > > > wrote: > > > > > > > > > Hello, > > > > > > > > > > No i cannot expose the stack, VisualVM samples won't show it to me. > > > > > > > > > > I am not sure if they're about to sync all the time, but every 15 > > > minutes > > > > > some documents are indexed (3 - 4k). For some reason, index time > does > > > > > increase with latency / CPU usage. > > > > > > > > > > This situation runs fine for many hours, then it will slowly start > to > > > go > > > > > bad, until nodes are restarted (or index size decreased). > > > > > > > > > > Thanks, > > > > > Markus > > > > > > > > > > -----Original message----- > > > > > > From:Mikhail Khludnev <m...@apache.org> > > > > > > Sent: Wednesday 19th July 2017 14:18 > > > > > > To: solr-user <solr-user@lucene.apache.org> > > > > > > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours > > > > > > > > > > > > > > > > > > > > The real distinction between busy and calm nodes is that busy > > > nodes all > > > > > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$ > > > FieldsReader.terms() > > > > > as > > > > > > > second to fillBuffer(), what are they doing? > > > > > > > > > > > > > > > > > > Can you expose the stack deeper? > > > > > > Can they start to sync shards due to some reason? > > > > > > > > > > > > On Wed, Jul 19, 2017 at 12:35 PM, Markus Jelsma < > > > > > markus.jel...@openindex.io> > > > > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > Another peculiarity here, our six node (2 shards / 3 replica's) > > > > > cluster is > > > > > > > going crazy after a good part of the day has passed. It starts > > > eating > > > > > CPU > > > > > > > for no good reason and its latency goes up. Grafana graphs > show the > > > > > problem > > > > > > > really well > > > > > > > > > > > > > > After restarting 2/6 nodes, there is also quite a distinction > in > > > the > > > > > > > VisualVM monitor views, and the VisualVM CPU sampler reports > > > (sorted on > > > > > > > self time (CPU)). The busy nodes are deeply red in > o.a.h.impl.io. > > > > > > > AbstractSessionInputBuffer.fillBuffer (as usual), the restarted > > > nodes > > > > > are > > > > > > > not. > > > > > > > > > > > > > > The real distinction between busy and calm nodes is that busy > > > nodes all > > > > > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$ > > > FieldsReader.terms() > > > > > as > > > > > > > second to fillBuffer(), what are they doing?! Why? The calm > nodes > > > don't > > > > > > > show this at all. Busy nodes all have o.a.l.codec stuff on top, > > > > > restarted > > > > > > > nodes don't. > > > > > > > > > > > > > > So, actually, i don't have a clue! Any, any ideas? > > > > > > > > > > > > > > Thanks, > > > > > > > Markus > > > > > > > > > > > > > > Each replica is underpowered but performing really well after > > > restart > > > > > (and > > > > > > > JVM warmup), 4 CPU's, 900M heap, 8 GB RAM, maxDoc 2.8 million, > > > index > > > > > size > > > > > > > 18 GB. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sincerely yours > > > > > > Mikhail Khludnev > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Sincerely yours > > > > Mikhail Khludnev > > > > > > > > > > > > > > > -- > > Sincerely yours > > Mikhail Khludnev > > >