On 1/19/2016 1:30 PM, Troy Edwards wrote:
We are currently "beta testing" a SolrCloud with 2 nodes and 2 shards with
2 replicas each. The number of documents is about 125000.

We now want to scale this to about 10 billion documents.

What are the steps to prototyping, hardware estimation and stress testing?

There is no general information available for sizing, because there are too many factors that will affect the answers. Some of the important information that you need will be impossible to predict until you actually build it and subject it to a real query load.

https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

With an index of 10 billion documents, you may not be able to precisely predict performance and hardware requirements from a small-scale prototype. You'll likely need to build a full-scale system on a small testbed, look for bottlenecks, ask for advice, and plan on a larger system for production.

The hard limit for documents on a single shard is slightly less than Java's Integer.MAX_VALUE -- just over two billion. Because deleted documents count against this max, about one billion documents per shard is the absolute max that should be loaded in practice.

BUT, if you actually try to put one billion documents in a single server, performance will likely be awful. A more reasonable limit per machine is 100 million ... but even this is quite large. You might need smaller shards, or you might be able to get good performance with larger shards. It all depends on things that you may not even know yet.

Memory is always a strong driver for Solr performance, and I am speaking specifically of OS disk cache -- memory that has not been allocated by any program. With 10 billion documents, your total index size will likely be hundreds of gigabytes, and might even reach terabyte scale. Good performance with indexes this large will require a lot of total memory, which probably means that you will need a lot of servers and many shards. SSD storage is strongly recommended.

For extreme scaling on Solr, especially if the query rate will be high, it is recommended to only have one shard replica per server.

I have just added an "extreme scaling" section to the following wiki page, but it's mostly a placeholder right now. I would like to have a discussion with people who operate very large indexes so I can put real usable information in this section. I'm on IRC quite frequently in the #solr channel.

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Reply via email to