On 3/25/2018 7:15 AM, Deepak Goel wrote:
$ Why is the 'qps' not increasing with increase in threads? (If I
understand the qps parameter right?)

Likely because I sent all these queries to a single copy of the index.  We only have two copies of the index in production, plus a third copy on a dev server running a newer version of Solr. I sent the queries from the test program to the production server pair that's designated "standby" -- not receiving queries unless the other pair is down.

Our Solr servers do not handle a high query load.  It's usually less than two queries per second.

Handling a very high query load requires load balancing to multiple copies of the index (replicas in SolrCloud terminology). We don't need that, so we don't have a bunch of copies.  The only reason we have two copies is so we can handle hardware failure gracefully.  I bypassed the load balancer for these tests.

$ Is it possible to run with 10 & 5 & 2 threads?

Sure.

I have updated the gist with those results.

https://gist.github.com/elyograg/abedf4ae28467059e46781f7d474f379

$ What were the server utilisation (CPU, Memory) when you ran the test?

I actually never looked when I was running the tests before.  I ran additional tests so I could gather that data.  The updated gist has vmstat information (while running a 20 thread test, and while running a 200 thread test) for the server side. The server named idxa1 has a higher CPU load because it is aggregating the shard data and replying to the query, in addition to serving three out of the seven shards.  The server named idxa2 has four shards.  The extra shard on idxa2 is very small - a little over 321000 docs, a little over 500MB disk used.  This is where new docs are written.

The CPU load on idxa2 is similar for both thread levels.  I this is because all queries are served from cache.  But idxa1 shows a higher load, because even when the cache is used, that server must still aggregate the shard data (which was pulled from cache) and create responses.  The aggregation is not cached, because Solr has no way to know that what it is receiving from the shards is cached data.

Here's the benchmark output from the 200 thread test when I was getting the CPU information:

query count: 200000
elapsed count: 200000
query median: 488.0
elapsed median: 500.0
query 75th: 674.0
elapsed 75th: 686.0
query 95th: 1006.0
elapsed 95th: 1018.0
query 99th: 1283.01
elapsed 99th: 1299.0
total time in seconds: 542
numThreads: 200
queries per thread: 1000
qps: 369

$ The 'query median' increases from 35 to 470 as you increase threads from
20 to 200 (You had mentioned earlier that QTime for Banjo query was 11 when
you had hit it the second time around)

When I got 11 ms, that was doing *one* query.  This program does a lot of them, so I'm not surprised by the increase.  I did the one-off queries on the dev server, not the standby production servers that received the load test.  The hardware specs are similar, except that in dev, the entire index is on one server running Solr 6.6.2.  That server also contains other indexes not being handled by the production pair I used for the load test.

$ Can you please give Linux server configuration if possible?

What *exactly* are you looking for here?  I've got some information below, but I do not know if it's what you are after.

High level, first server (idxa1):
Dell PowerEdge 2950 III
Two 4-core CPUs.
model name      : Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz
64GB memory
Solr is version 4.7.2, with an 8GB heap
About 140GB of index data
CentOS 6, kernel 2.6.32-431.11.2.el6.centos.plus.x86_64
Oracla java:
java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)

Differences on the second server (idxa2):
model name      : Intel(R) Xeon(R) CPU           E5420  @ 2.50GHz
Slightly more (about 500MB) index data.
2.6.32-504.12.2.el6.centos.plus.x86_64.

The whole production index is in the ballpark of 280GB, and contains over 187 million docs.  The dev server has more than 188 million docs.  I think the reason that the counts are different is because we very recently deleted a bunch of data from the database, but skipped the update of the Solr index for the deletion.  The production indexes have been rebuilt since the delete, but the dev index hasn't.

The network between the client running the test and the Solr servers includes a layer 3 switch, some layer 2 switches, and a firewall.  All network hardware is made by Cisco.  The entire path (including the firewall) is gigabit.

Thanks,
Shawn

Reply via email to