On 1/19/2016 1:30 PM, Troy Edwards wrote:
We are currently "beta testing" a SolrCloud with 2 nodes and 2 shards with
2 replicas each. The number of documents is about 125000.
We now want to scale this to about 10 billion documents.
What are the steps to prototyping, hardware estimation and stress testing?
There is no general information available for sizing, because there are
too many factors that will affect the answers. Some of the important
information that you need will be impossible to predict until you
actually build it and subject it to a real query load.
https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
With an index of 10 billion documents, you may not be able to precisely
predict performance and hardware requirements from a small-scale
prototype. You'll likely need to build a full-scale system on a small
testbed, look for bottlenecks, ask for advice, and plan on a larger
system for production.
The hard limit for documents on a single shard is slightly less than
Java's Integer.MAX_VALUE -- just over two billion. Because deleted
documents count against this max, about one billion documents per shard
is the absolute max that should be loaded in practice.
BUT, if you actually try to put one billion documents in a single
server, performance will likely be awful. A more reasonable limit per
machine is 100 million ... but even this is quite large. You might need
smaller shards, or you might be able to get good performance with larger
shards. It all depends on things that you may not even know yet.
Memory is always a strong driver for Solr performance, and I am speaking
specifically of OS disk cache -- memory that has not been allocated by
any program. With 10 billion documents, your total index size will
likely be hundreds of gigabytes, and might even reach terabyte scale.
Good performance with indexes this large will require a lot of total
memory, which probably means that you will need a lot of servers and
many shards. SSD storage is strongly recommended.
For extreme scaling on Solr, especially if the query rate will be high,
it is recommended to only have one shard replica per server.
I have just added an "extreme scaling" section to the following wiki
page, but it's mostly a placeholder right now. I would like to have a
discussion with people who operate very large indexes so I can put real
usable information in this section. I'm on IRC quite frequently in the
#solr channel.
https://wiki.apache.org/solr/SolrPerformanceProblems
Thanks,
Shawn