Hi,

We've got a 3 node SolrCloud cluster running on GKE, each on their own kube
node (which is in itself, relatively empty of other things).

Our collection has ~18m documents of 36gb in size, split into 6 shards with
2 replicas each, and they are evenly distributed across the 3 nodes. Our
JVMs are currently sized to ~14gb min & max , and they are running on SSDs.


[image: Screen Shot 2020-10-23 at 2.15.48 pm.png]

Graph also available here: https://pasteboard.co/JwUQ98M.png

Under perf testing of ~30 requests per second, we start seeing really bad
response times (around 3s in the 90th percentile, and *one* of the nodes
would be fully maxed out on CPU. At about 15 requests per second, our
response times are reasonable enough for our purposes (~0.8-1.1s), but as
is visible in the graph, it's definitely *not* an even distribution of the
CPU load. One of the nodes is running at around 13cores, whilst the other 2
are running at ~8cores and 6 cores respectively.

We've tracked in our monitoring tools that the 3 nodes *are* getting an
even distribution of requests, and we're using a Kube service which is in
itself a fairly well known tool for load balancing pods. We've also used
kube services heaps for load balancing of other apps and haven't seen such
a problem, so we doubt it's the load balancer that is the problem.

All 3 nodes are built from the same kubernetes statefulset deployment so
they'd all have the same configuration & setup. Additionally, over the
course of the day, it may suddenly change so that an entirely different
node is the one that is majorly overloaded on CPU.

All this is happening only under queries, and we are doing no indexing at
that time.

We'd initially thought it might be the overseer that is being majorly
overloaded when under queries (although we were surprised) until we did
more testing and found that even the nodes that weren't overseer would
sometimes have that disparity. We'd also tried using the `ADDROLE` API to
force an overseer change in the middle of a test, and whilst the tree
updated to show that the overseer had changed, it made no difference to the
highest CPU load.

Directing queries directly to the non-busy nodes do actually give us back
decent response times.

We're quite puzzled by this and would really like some help figuring out
*why* the CPU on one is so much higher. I did try to get the jaeger tracing
working (we already have jaeger in our cluster), but we just kept getting
errors on startup with solr not being able to load the main function...


Thank you in advance!
Cheers
Jonathan

Reply via email to