On 3/10/2018 9:44 AM, Erick Erickson wrote:
There are quite a number of reasons you may be seeing this, all having
to do with trying to put too much stuff in too little hardware.
<snip>
At any rate, there's no a-priori limit to the number of
collections/replicas/whatever that Solr can deal with, the limits are
your hardware.

Big +1 to everything Erick said.  I was working on a response, but I think the things that Erick said are better than what I was writing.  I did want to add a little bit of detail, though.

Let me throw some numbers at you:

With 900 collections that have 25 shards, the SolrCloud cluster is handling 22500 individual indexes -- and that's for only one replica.  If I assume that you've only got one copy of all that data in Solr, then every one of those 49 servers will average over 450 index cores!  If you planned for redundancy and replicationFactor is 2, then each server will have over 900 index cores on it (45000 total indexes).

The index count numbers on each server might look like small numbers to you ... but when you start understanding some of what Solr and Lucene are doing under the covers, you start to realize that these numbers are quite large.

If there's even a hint of significant index/query traffic going to those collections, then those servers are going to be hard-pressed to keep up, unless you've spent a LOT more money for hardware than what I see in a typical server.

Have you ever heard of a phenomenon known as a performance knee? The following link should open up to page 482 of the associated book.  Scroll up to page 481 and reading the two sections named "Hardware Resource Requirements" and "Peak Load Behavior".  Your reading would end partway through page 482.

https://books.google.com/books?id=ka4QO9kXQFUC&pg=PA482&lpg=PA482&dq=performance+knee+scalability+graph&source=bl&ots=ytNn04PMU8&sig=r8GAGLIaT4mDwjCjYs_waQJbqo8&hl=en&sa=X&ved=0ahUKEwjH6bSdq-HZAhUJslMKHeQhALQQ6AEINDAB#v=onepage&q=performance%20knee%20scalability%20graph&f=false

A common tendency in the performance of computer systems/software is that they show an increase in response time that's more or less linear with load increases, until the system reaches a breaking point where the smallest increase in load causes an incredibly large increase in response times.  That can lead to an outage situation where things take so long that clients give up and go to an error state.

FYI: You're running a "dot zero" release.  6.0 is the first release where MAJOR development changes had been accumulating that weren't in 5.x versions.  The 6.x timeframe was an absolute hotbed of innovation on SolrCloud.

There are typically more bugs in a .0 release than later minor releases.  If there are reasons for you to avoid running 7.x releases, then you should upgrade to the latest 6.x.  Right now this is 6.6.2, but we are in the middle of the 6.6.3 release right now, and that is likely to be announced any day now.

SolrCloud does not handle large numbers of collections very well, especially if the shard and/or replica counts are high for each collection.  Up to a point, throwing a LOT of hardware at the problem helps, but there are scalability issues related to the way SolrCloud uses the ZooKeeper database.  Eventually it just can't handle it.  See this issue for some experiments that I conducted on the problem:

https://issues.apache.org/jira/browse/SOLR-7191

Thanks,
Shawn

Reply via email to