RE: Billion document index

pankaj.pandey4 Wed, 15 May 2013 02:56:48 -0700

Thanks Shawn for explaining everything in such detail, it was really helpful.

Have few more queries on the same. Can you please explain the purpose of the 
3rd box in minimal configuration, with the standalone zookeeper?

On separate note, if we go with ahead with 4 box(8 shard with replication 
factor 2 for each):
        1. Would it be ok to maintain the replica on the same box or we would 
need separate box for that?
        2. Is the above configuration sufficient enough to guarantee failover 
and high availability?
        3. How can I configure my application to query always against the 
replica and let the master be used only for ingestion. Replica will be synced 
with    master after working hours(overnight).

Regards,
Pankaj

-----Original Message-----
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: Wednesday, May 15, 2013 12:01 PM
To: solr-user@lucene.apache.org
Subject: Billion document index

On 5/14/2013 11:00 PM, pankaj.pand...@wipro.com wrote:
> We have to setup a billion document index using Apache Solr(about 2 billion 
> docs). I need some assistance on choosing the right configuration for the 
> environment setup. I have been going through the Solr documentation, but 
> couldn't figure out what would be the best configuration for same.
>
> Below is the configuration on one of the box we have, can you please assist 
> if this will suffice our requirement.
> OS: SunOS
> RAM : 32GB
> Processor: 4 (SPARC64-VII+/2660Mhz)
> Server: Sun SPARC Enterprise M5000
>
> Desired System Requirements:
> Expected number of requests/day: 10,000 New Documents ingested/day:
> 1Million
>
> I have below questions on same:
>
> 1.       Does the above system configuration seems ok for our requirement?
>
> 2.       Is it ok if we host entire index on a single physical box or I 
> should use multiple physical box?
>
> 3.       Should I go for the simple installation or SolrCloud?
>
> 4.       If I should use SolrCloud, then probably I may have to use some 
> master/slave setup(not sure though)? What I have in mind is to use master for 
> ingestion of new documents and slave for querying. Then at the end of the day 
> I can have the replication to update the slaves? Can you please advise if 
> this is a good approach and most importantly if this is feasible?
>
> On google, I could find that many people have already setup such environment, 
> but I couldn't figure out the configuration they are using. If some can share 
> their experience, then it will probably help others as well.

There's a lot of information here for you to digest.  Be sure to read all the 
way to the end.  The numbers (and the costs associated with those numbers) 
might scare you.

I have no idea what's going to be in your index.  I even looked up the website 
for your email domain, and still don't know what you might be trying to search.

Because it uses a 32-bit signed number to track things, a single Solr index 
(not sharded) is limited to a little more than 2 billion documents.  This means 
you'll want to use distributed search (sharding) from the beginning.  For new 
deployments, SolrCloud is much better than trying to handle sharding yourself.  
You'll probably want 8 or more shards and a minimum replication factor of 2, so 
that you have two copies of every shard.  That doesn't necessarily mean that 
you'll need that many machines, but you might want to plan on at least four of 
them.
 You'll probably be putting more than one shard per server.

SolrCloud is a true cluster - there is no master and no slaves.  Both indexing 
and queries are completely distributed.  Clients that you write in Java get 
these distributed features with no extra requirements, non-Java clients will 
require some form of external load balancing to ensure that they are always 
talking to a server that's up.

The absolute minimum number of physical machines you need for SolrCloud is 
three.  Two of those need to be the beefy workhorses.  Each of them will run 
Solr and a standalone ZooKeeper.  The third machine can be modest and will just 
run a third instance of zookeeper.  If you have more than two servers that will 
run Solr, then you can just run the standalone zookeeper on three of them and 
won't need any extra hardware.

Memory is going to be your real problem with a very large index.  When it comes 
to the amount of required memory, you might want to read this wiki page, then 
come back here:

http://wiki.apache.org/solr/SolrPerformanceProblems

I really was serious about reading that page, and not just because I wrote it.  
The information you'll find there is key to understanding the scale of what you 
propose and what I'm going to say below.

Even with very small documents, an index with 2 billion of them is probably 
going to be at least 100GB, and quite possibly 300GB, 500GB, or larger.

For discussion purposes, let's say that you've got the extremely conservative 
index size of 100GB and you're going to put that on four servers.  To cache 
this index sufficiently to avoid performance problems, you'll need between 64GB 
and 128GB of total RAM for caching across the entire cluster.

If we assume that you've taken every possible step to reduce Solr's Java heap 
requirements, you might be able to do a heap of 8 to 16GB per server, but the 
actual heap requirement could be significantly higher.
Adding this up, you get a bare minimum memory requirement of 32GB for each of 
those four servers.  Ideally, you'd need to have 48GB for each of them.  If you 
plan to put it on two Solr servers instead of four, double the per-server 
memory requirement.

Remember that all the information in the previous paragraph assumes a total 
index size of 100GB, and your index has the potential to be a lot bigger than 
100GB.  If you have a 300GB index size instead of 100GB, triple those numbers.  
Scale up similarly for larger sizes.

One final note: your anticipated query volume is quite low, so you might be 
able to get away with a little bit less memory than I have described here, but 
you should be aware that running with less may cause query times measured in 
tens of seconds, and SolrCloud may become very unstable.

Thanks,
Shawn

Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com

RE: Billion document index

Reply via email to