On 5/14/2013 11:00 PM, pankaj.pand...@wipro.com wrote:
> We have to setup a billion document index using Apache Solr(about 2 billion 
> docs). I need some assistance on choosing the right configuration for the 
> environment setup. I have been going through the Solr documentation, but 
> couldn't figure out what would be the best configuration for same.
> 
> Below is the configuration on one of the box we have, can you please assist 
> if this will suffice our requirement.
> OS: SunOS
> RAM : 32GB
> Processor: 4 (SPARC64-VII+/2660Mhz)
> Server: Sun SPARC Enterprise M5000
> 
> Desired System Requirements:
> Expected number of requests/day: 10,000
> New Documents ingested/day: 1Million
> 
> I have below questions on same:
> 
> 1.       Does the above system configuration seems ok for our requirement?
> 
> 2.       Is it ok if we host entire index on a single physical box or I 
> should use multiple physical box?
> 
> 3.       Should I go for the simple installation or SolrCloud?
> 
> 4.       If I should use SolrCloud, then probably I may have to use some 
> master/slave setup(not sure though)? What I have in mind is to use master for 
> ingestion of new documents and slave for querying. Then at the end of the day 
> I can have the replication to update the slaves? Can you please advise if 
> this is a good approach and most importantly if this is feasible?
> 
> On google, I could find that many people have already setup such environment, 
> but I couldn't figure out the configuration they are using. If some can share 
> their experience, then it will probably help others as well.

There's a lot of information here for you to digest.  Be sure to read
all the way to the end.  The numbers (and the costs associated with
those numbers) might scare you.

I have no idea what's going to be in your index.  I even looked up the
website for your email domain, and still don't know what you might be
trying to search.

Because it uses a 32-bit signed number to track things, a single Solr
index (not sharded) is limited to a little more than 2 billion
documents.  This means you'll want to use distributed search (sharding)
from the beginning.  For new deployments, SolrCloud is much better than
trying to handle sharding yourself.  You'll probably want 8 or more
shards and a minimum replication factor of 2, so that you have two
copies of every shard.  That doesn't necessarily mean that you'll need
that many machines, but you might want to plan on at least four of them.
 You'll probably be putting more than one shard per server.

SolrCloud is a true cluster - there is no master and no slaves.  Both
indexing and queries are completely distributed.  Clients that you write
in Java get these distributed features with no extra requirements,
non-Java clients will require some form of external load balancing to
ensure that they are always talking to a server that's up.

The absolute minimum number of physical machines you need for SolrCloud
is three.  Two of those need to be the beefy workhorses.  Each of them
will run Solr and a standalone ZooKeeper.  The third machine can be
modest and will just run a third instance of zookeeper.  If you have
more than two servers that will run Solr, then you can just run the
standalone zookeeper on three of them and won't need any extra hardware.

Memory is going to be your real problem with a very large index.  When
it comes to the amount of required memory, you might want to read this
wiki page, then come back here:

http://wiki.apache.org/solr/SolrPerformanceProblems

I really was serious about reading that page, and not just because I
wrote it.  The information you'll find there is key to understanding the
scale of what you propose and what I'm going to say below.

Even with very small documents, an index with 2 billion of them is
probably going to be at least 100GB, and quite possibly 300GB, 500GB, or
larger.

For discussion purposes, let's say that you've got the extremely
conservative index size of 100GB and you're going to put that on four
servers.  To cache this index sufficiently to avoid performance
problems, you'll need between 64GB and 128GB of total RAM for caching
across the entire cluster.

If we assume that you've taken every possible step to reduce Solr's Java
heap requirements, you might be able to do a heap of 8 to 16GB per
server, but the actual heap requirement could be significantly higher.
Adding this up, you get a bare minimum memory requirement of 32GB for
each of those four servers.  Ideally, you'd need to have 48GB for each
of them.  If you plan to put it on two Solr servers instead of four,
double the per-server memory requirement.

Remember that all the information in the previous paragraph assumes a
total index size of 100GB, and your index has the potential to be a lot
bigger than 100GB.  If you have a 300GB index size instead of 100GB,
triple those numbers.  Scale up similarly for larger sizes.

One final note: your anticipated query volume is quite low, so you might
be able to get away with a little bit less memory than I have described
here, but you should be aware that running with less may cause query
times measured in tens of seconds, and SolrCloud may become very unstable.

Thanks,
Shawn

Reply via email to