Thanks for these informations, I didn't know this notion of Overseer yet.

When I say "high QPS", it's only queries. There is no document indexed at the 
time of this issue.

I have two thread dumps took at a time we were having the issue:
- one on the server having the issue (CPU~90%): https://pastebin.com/NeeSXj9B
- one on another server not having issues (CPU ~20%): 
https://pastebin.com/vgExMf4s
None of those 2 servers are the Overseer, nor the collection leader.

Apart from the fact that the server having issues has a lot more threads in 
RUNNABLE state while the other has its threads in TIMED_WAITING, I don't see 
anything but I'm not experienced at reading these dumps.

I had a look with a profiler on the server having issues.
During the high CPU, if I sort the method by "self time CPU" spent I get:
- org.apache.solr.search.BitDocSet.andNot -12%
- sun.nio.ch.ServerSocketChannelImpl.accept - 8.5%
- org.apache.solr.uninverting.FieldCacheImpl$LongsFromArray$1.longValue - 5.9%
- org.apache.lucene.util.FixedBitSet.clone - 5.9%
- java.util.HashSet.add - 5.4%
- org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.<init> - 4.1%
- org.apache.lucene.search.DisjunctionDISIApproximation.advance - 3.6%
- org.apache.lucene.codecs.lucene50.ForUtil.readBlock - 2.6%

I'll open a JIRA if you think it could be useful but I don't want to pollute 
JIRA with issues not fully qualified yet.

Kind Regards,
Gaël




________________________________
De : Erick Erickson <erickerick...@gmail.com>
Envoyé : lundi 4 mars 2019 18:53
À : Gael Jourdan-Weil
Objet : Re: SolrCloud one server with high load

Yes, Overseer is different than leader.

There is one leader is _per shard_, and it’s job is to coordinate updates to 
the index for that shard.

The Overseer coordinates the updates to ZooKeeper, and there is one (and only 
one) per cluster, and a cluster can contain many collections and each 
collection can have many shards. So there may be an unbounded number of 
leaders….

Right, Solr 7.6 is one of the versions that has some anecdotal comments like 
this. If at all possible, could you take a thread dump of a node when it’s 
having this problem? Or, even better, put a profiler on it? It’d be invaluable 
to see where the time was being spent. If you can do either of those things, 
please open a JIRA and attach the output.

If you do raise a JIRA, please include the information that the Overseer isn’t 
the one having the problem

One other bit of info that’d be useful is whether when you say “high QPS”, is 
it all queries or are you adding documents too?

Best,
Erick

> On Mar 4, 2019, at 9:45 AM, Gael Jourdan-Weil 
> <gael.jourdan-w...@kelkoogroup.com> wrote:
>
> Hi Erick,
>
> We are running Solr 7.6.0.
> We recently upgraded from 7.2.1 but we already had theses issues with Solr 
> 7.2.1.
>
> Is the overseer different from the leader?
> In the Solr Admin UI > SolrCloud > Tree > overseer > leader file I can see 
> the machine being the leader is not the one having issues right now.
>
> Kind Regards,
> Gaël
> De : Erick Erickson <erickerick...@gmail.com>
> Envoyé : lundi 4 mars 2019 17:57
> À : solr-user@lucene.apache.org
> Objet : Re: SolrCloud one server with high load
>
> What version of Solr? There are some anecdotal reports of abnormal CPU loads 
> on very recent Solr’s.
>
> Is the server with the high load the “Overseer”? In the admin 
> UI>>SolrCloud>>tree you can see which node is the Overseer. This is really a 
> shot in the dark, as unless you are doing a lot of collection maintenance 
> operations, the Overseer shouldn’t be doing much really.
>
> There is _one_ Overseer per cluster and it’s in charge of coordinating 
> changes to ZooKeeper.
>
> If there’s a correlation there, it’d be great to know. It’s possible to move 
> the Overseer to a different node, one that’s running Solr but not necessarily 
> hosting any replicas. This isn’t a permanent solution, but would help isolate 
> the issue.
>
> First, let’s see if the not node is always the Overseer...
>
> Best,
> Erick
>
> > On Mar 4, 2019, at 6:51 AM, Gael Jourdan-Weil 
> > <gael.jourdan-w...@kelkoogroup.com> wrote:
> >
> > Hello Furkan,
> >
> > Yes the 3 servers have exact same configuration.
> >
> > Varnish load balancing is effectively round robin.
> > We monitor the number of requests per second, and we effectively see the 3 
> > servers are receiving same amount of requests.
> >
> > Kind Regards,
> > Gaël
> >
> > ________________________________
> > De : Furkan KAMACI <furkankam...@gmail.com>
> > Envoyé : lundi 4 mars 2019 15:00
> > À : solr-user@lucene.apache.org
> > Objet : Re: SolrCloud one server with high load
> >
> > Hi Gaël,
> >
> > Does all three servers have same specifications? On the other hand, is your
> > load balancing configuration for Varnish is round-robin?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Mon, Mar 4, 2019 at 3:18 PM Gael Jourdan-Weil <
> > gael.jourdan-w...@kelkoogroup.com> wrote:
> >
> >> Hello,
> >>
> >> I come again to the community for some ideas regarding a performance issue
> >> we are having.
> >>
> >> We have a SolrCloud cluster of 3 servers.
> >> Each server hosts 1 replica of 2 collections.
> >> There is no sharding, every server hosts the whole collection.
> >>
> >> Requests are evenly distributed by a Varnish system.
> >>
> >> During some peaks of requests, we see one server of the cluster having
> >> very high load while the two others are totally fine.
> >> The server experiencing this high load is always the same until we reboot
> >> it and the behavior moves to another server.
> >> The server experiencing the issue is not necessarily the leader.
> >> All servers receive the same number of requests per seconds.
> >>
> >> Load data:
> >> - Server1: 5% CPU when low QPS, 90% CPU when high QPS (this one having
> >> issues)
> >> - Server2: 5% CPU when low QPS, 25% CPU when high QPS
> >> - Server3: 5% CPU when low QPS, 20% CPU when high QPS
> >>
> >> What could explain this behavior in SolrCloud mechanisms?
> >>
> >> Thank you for reading,
> >>
> >> Gaël Jourdan-Weil
> >>

Reply via email to