Ideas for debugging poor SolrCloud scalability

Ian Rose Thu, 30 Oct 2014 13:25:02 -0700

Howdy all -

The short version is: We are not seeing Solr Cloud performance scale (event
close to) linearly as we add nodes. Can anyone suggest good diagnostics for
finding scaling bottlenecks? Are there known 'gotchas' that make Solr Cloud
fail to scale?


In detail:

We have used Solr (in non-Cloud mode) for over a year and are now beginning
a transition to SolrCloud.  To this end I have been running some basic load
tests to figure out what kind of capacity we should expect to provision.
In short, I am seeing very poor scalability (increase in effective QPS) as
I add Solr nodes.  I'm hoping to get some ideas on where I should be
looking to debug this.  Apologies in advance for the length of this email;
I'm trying to be comprehensive and provide all relevant information.

Our setup:

1 load generating client
 - generates tiny, fake documents with unique IDs
 - performs only writes (no queries at all)
 - chooses a random solr server for each ADD request (with 1 doc per add
request)

N collections spread over K solr servers
 - every collection is sharded K times (so every solr instance has 1 shard
from every collection)
 - no replicas
 - external zookeeper server (not using zkRun)
 - autoCommit maxTime=60000
 - autoSoftCommit maxTime =15000

Everything is running within a single zone on Google Compute Engine, so
high quality gigabit network links between all machines (ping times < 1ms).

My methodology is as follows.
1. Start up a K solr servers.
2. Remove all existing collections.
3. Create N collections, with numShards=K for each.
4. Start load testing.  Every minute, print the number of successful
updates and the number of failed updates.
5. Keep increasing the offered load (via simulated users) until the qps
flatlines.

In brief (more detailed results at the bottom of email), I find that for
any number of nodes between 2 and 5, the QPS always caps out at ~3000.
Obviously something must be wrong here, as there should be a trend of the
QPS scaling (roughly) linearly with the number of nodes.  Or at the very
least going up at all!

So my question is what else should I be looking at here?

* CPU on the loadtest client is well under 100%
* No other obvious bottlenecks on loadtest client (running 2 clients leads
to ~1/2 qps on each)
* In many cases, CPU on the solr servers is quite low as well (e.g. with
100 users hitting 5 solr nodes, all nodes are >50% idle)
* Network bandwidth is a few MB/s, well under the gigabit capacity of our
network
* Disk bandwidth (< 2 MB/s) and iops (< 20/s) are low.

Any ideas?  Thanks very much!
- Ian


p.s. Here is my raw data broken out by number of nodes and number of
simulated users:


Num NodesNum UsersQPS111020153180110382511539001204050140410021472251790210
229021528502202900240321026032002803210210031803138535158031020903152560320
27603252890380305041375451560410220041525004202700425280043028505152450520
2640525279053028405100290052002810

Ideas for debugging poor SolrCloud scalability

Reply via email to