Thanks Shawn for explaining everything in such detail, it was really helpful.
Have few more queries on the same. Can you please explain the purpose of the 3rd box in minimal configuration, with the standalone zookeeper? On separate note, if we go with ahead with 4 box(8 shard with replication factor 2 for each): 1. Would it be ok to maintain the replica on the same box or we would need separate box for that? 2. Is the above configuration sufficient enough to guarantee failover and high availability? 3. How can I configure my application to query always against the replica and let the master be used only for ingestion. Replica will be synced with master after working hours(overnight). Regards, Pankaj -----Original Message----- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Wednesday, May 15, 2013 12:01 PM To: solr-user@lucene.apache.org Subject: Billion document index On 5/14/2013 11:00 PM, pankaj.pand...@wipro.com wrote: > We have to setup a billion document index using Apache Solr(about 2 billion > docs). I need some assistance on choosing the right configuration for the > environment setup. I have been going through the Solr documentation, but > couldn't figure out what would be the best configuration for same. > > Below is the configuration on one of the box we have, can you please assist > if this will suffice our requirement. > OS: SunOS > RAM : 32GB > Processor: 4 (SPARC64-VII+/2660Mhz) > Server: Sun SPARC Enterprise M5000 > > Desired System Requirements: > Expected number of requests/day: 10,000 New Documents ingested/day: > 1Million > > I have below questions on same: > > 1. Does the above system configuration seems ok for our requirement? > > 2. Is it ok if we host entire index on a single physical box or I > should use multiple physical box? > > 3. Should I go for the simple installation or SolrCloud? > > 4. If I should use SolrCloud, then probably I may have to use some > master/slave setup(not sure though)? What I have in mind is to use master for > ingestion of new documents and slave for querying. Then at the end of the day > I can have the replication to update the slaves? Can you please advise if > this is a good approach and most importantly if this is feasible? > > On google, I could find that many people have already setup such environment, > but I couldn't figure out the configuration they are using. If some can share > their experience, then it will probably help others as well. There's a lot of information here for you to digest. Be sure to read all the way to the end. The numbers (and the costs associated with those numbers) might scare you. I have no idea what's going to be in your index. I even looked up the website for your email domain, and still don't know what you might be trying to search. Because it uses a 32-bit signed number to track things, a single Solr index (not sharded) is limited to a little more than 2 billion documents. This means you'll want to use distributed search (sharding) from the beginning. For new deployments, SolrCloud is much better than trying to handle sharding yourself. You'll probably want 8 or more shards and a minimum replication factor of 2, so that you have two copies of every shard. That doesn't necessarily mean that you'll need that many machines, but you might want to plan on at least four of them. You'll probably be putting more than one shard per server. SolrCloud is a true cluster - there is no master and no slaves. Both indexing and queries are completely distributed. Clients that you write in Java get these distributed features with no extra requirements, non-Java clients will require some form of external load balancing to ensure that they are always talking to a server that's up. The absolute minimum number of physical machines you need for SolrCloud is three. Two of those need to be the beefy workhorses. Each of them will run Solr and a standalone ZooKeeper. The third machine can be modest and will just run a third instance of zookeeper. If you have more than two servers that will run Solr, then you can just run the standalone zookeeper on three of them and won't need any extra hardware. Memory is going to be your real problem with a very large index. When it comes to the amount of required memory, you might want to read this wiki page, then come back here: http://wiki.apache.org/solr/SolrPerformanceProblems I really was serious about reading that page, and not just because I wrote it. The information you'll find there is key to understanding the scale of what you propose and what I'm going to say below. Even with very small documents, an index with 2 billion of them is probably going to be at least 100GB, and quite possibly 300GB, 500GB, or larger. For discussion purposes, let's say that you've got the extremely conservative index size of 100GB and you're going to put that on four servers. To cache this index sufficiently to avoid performance problems, you'll need between 64GB and 128GB of total RAM for caching across the entire cluster. If we assume that you've taken every possible step to reduce Solr's Java heap requirements, you might be able to do a heap of 8 to 16GB per server, but the actual heap requirement could be significantly higher. Adding this up, you get a bare minimum memory requirement of 32GB for each of those four servers. Ideally, you'd need to have 48GB for each of them. If you plan to put it on two Solr servers instead of four, double the per-server memory requirement. Remember that all the information in the previous paragraph assumes a total index size of 100GB, and your index has the potential to be a lot bigger than 100GB. If you have a 300GB index size instead of 100GB, triple those numbers. Scale up similarly for larger sizes. One final note: your anticipated query volume is quite low, so you might be able to get away with a little bit less memory than I have described here, but you should be aware that running with less may cause query times measured in tens of seconds, and SolrCloud may become very unstable. Thanks, Shawn Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com