On 3/10/2018 9:44 AM, Erick Erickson wrote:
There are quite a number of reasons you may be seeing this, all having
to do with trying to put too much stuff in too little hardware.
<snip>
At any rate, there's no a-priori limit to the number of
collections/replicas/whatever that Solr can deal with, the limits are
your hardware.
Big +1 to everything Erick said. I was working on a response, but I
think the things that Erick said are better than what I was writing. I
did want to add a little bit of detail, though.
Let me throw some numbers at you:
With 900 collections that have 25 shards, the SolrCloud cluster is
handling 22500 individual indexes -- and that's for only one replica.
If I assume that you've only got one copy of all that data in Solr, then
every one of those 49 servers will average over 450 index cores! If you
planned for redundancy and replicationFactor is 2, then each server will
have over 900 index cores on it (45000 total indexes).
The index count numbers on each server might look like small numbers to
you ... but when you start understanding some of what Solr and Lucene
are doing under the covers, you start to realize that these numbers are
quite large.
If there's even a hint of significant index/query traffic going to those
collections, then those servers are going to be hard-pressed to keep up,
unless you've spent a LOT more money for hardware than what I see in a
typical server.
Have you ever heard of a phenomenon known as a performance knee? The
following link should open up to page 482 of the associated book.
Scroll up to page 481 and reading the two sections named "Hardware
Resource Requirements" and "Peak Load Behavior". Your reading would end
partway through page 482.
https://books.google.com/books?id=ka4QO9kXQFUC&pg=PA482&lpg=PA482&dq=performance+knee+scalability+graph&source=bl&ots=ytNn04PMU8&sig=r8GAGLIaT4mDwjCjYs_waQJbqo8&hl=en&sa=X&ved=0ahUKEwjH6bSdq-HZAhUJslMKHeQhALQQ6AEINDAB#v=onepage&q=performance%20knee%20scalability%20graph&f=false
A common tendency in the performance of computer systems/software is
that they show an increase in response time that's more or less linear
with load increases, until the system reaches a breaking point where the
smallest increase in load causes an incredibly large increase in
response times. That can lead to an outage situation where things take
so long that clients give up and go to an error state.
FYI: You're running a "dot zero" release. 6.0 is the first release
where MAJOR development changes had been accumulating that weren't in
5.x versions. The 6.x timeframe was an absolute hotbed of innovation on
SolrCloud.
There are typically more bugs in a .0 release than later minor
releases. If there are reasons for you to avoid running 7.x releases,
then you should upgrade to the latest 6.x. Right now this is 6.6.2, but
we are in the middle of the 6.6.3 release right now, and that is likely
to be announced any day now.
SolrCloud does not handle large numbers of collections very well,
especially if the shard and/or replica counts are high for each
collection. Up to a point, throwing a LOT of hardware at the problem
helps, but there are scalability issues related to the way SolrCloud
uses the ZooKeeper database. Eventually it just can't handle it. See
this issue for some experiments that I conducted on the problem:
https://issues.apache.org/jira/browse/SOLR-7191
Thanks,
Shawn