Howdy all - The short version is: We are not seeing Solr Cloud performance scale (event close to) linearly as we add nodes. Can anyone suggest good diagnostics for finding scaling bottlenecks? Are there known 'gotchas' that make Solr Cloud fail to scale?
In detail: We have used Solr (in non-Cloud mode) for over a year and are now beginning a transition to SolrCloud. To this end I have been running some basic load tests to figure out what kind of capacity we should expect to provision. In short, I am seeing very poor scalability (increase in effective QPS) as I add Solr nodes. I'm hoping to get some ideas on where I should be looking to debug this. Apologies in advance for the length of this email; I'm trying to be comprehensive and provide all relevant information. Our setup: 1 load generating client - generates tiny, fake documents with unique IDs - performs only writes (no queries at all) - chooses a random solr server for each ADD request (with 1 doc per add request) N collections spread over K solr servers - every collection is sharded K times (so every solr instance has 1 shard from every collection) - no replicas - external zookeeper server (not using zkRun) - autoCommit maxTime=60000 - autoSoftCommit maxTime =15000 Everything is running within a single zone on Google Compute Engine, so high quality gigabit network links between all machines (ping times < 1ms). My methodology is as follows. 1. Start up a K solr servers. 2. Remove all existing collections. 3. Create N collections, with numShards=K for each. 4. Start load testing. Every minute, print the number of successful updates and the number of failed updates. 5. Keep increasing the offered load (via simulated users) until the qps flatlines. In brief (more detailed results at the bottom of email), I find that for any number of nodes between 2 and 5, the QPS always caps out at ~3000. Obviously something must be wrong here, as there should be a trend of the QPS scaling (roughly) linearly with the number of nodes. Or at the very least going up at all! So my question is what else should I be looking at here? * CPU on the loadtest client is well under 100% * No other obvious bottlenecks on loadtest client (running 2 clients leads to ~1/2 qps on each) * In many cases, CPU on the solr servers is quite low as well (e.g. with 100 users hitting 5 solr nodes, all nodes are >50% idle) * Network bandwidth is a few MB/s, well under the gigabit capacity of our network * Disk bandwidth (< 2 MB/s) and iops (< 20/s) are low. Any ideas? Thanks very much! - Ian p.s. Here is my raw data broken out by number of nodes and number of simulated users: Num NodesNum UsersQPS111020153180110382511539001204050140410021472251790210 229021528502202900240321026032002803210210031803138535158031020903152560320 27603252890380305041375451560410220041525004202700425280043028505152450520 2640525279053028405100290052002810