Thanks Shawn for explaining everything in such detail, it was really helpful.
Have few more queries on the same. Can you please explain the purpose of the
3rd box in minimal configuration, with the standalone zookeeper?
On separate note, if we go with ahead with 4 box(8 shard with replication
factor 2 for each):
1. Would it be ok to maintain the replica on the same box or we would
need separate box for that?
2. Is the above configuration sufficient enough to guarantee failover
and high availability?
3. How can I configure my application to query always against the
replica and let the master be used only for ingestion. Replica will be synced
withmaster after working hours(overnight).
Regards,
Pankaj
-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: Wednesday, May 15, 2013 12:01 PM
To: solr-user@lucene.apache.org
Subject: Billion document index
On 5/14/2013 11:00 PM, pankaj.pand...@wipro.com wrote:
> We have to setup a billion document index using Apache Solr(about 2 billion
> docs). I need some assistance on choosing the right configuration for the
> environment setup. I have been going through the Solr documentation, but
> couldn't figure out what would be the best configuration for same.
>
> Below is the configuration on one of the box we have, can you please assist
> if this will suffice our requirement.
> OS: SunOS
> RAM : 32GB
> Processor: 4 (SPARC64-VII+/2660Mhz)
> Server: Sun SPARC Enterprise M5000
>
> Desired System Requirements:
> Expected number of requests/day: 10,000 New Documents ingested/day:
> 1Million
>
> I have below questions on same:
>
> 1. Does the above system configuration seems ok for our requirement?
>
> 2. Is it ok if we host entire index on a single physical box or I
> should use multiple physical box?
>
> 3. Should I go for the simple installation or SolrCloud?
>
> 4. If I should use SolrCloud, then probably I may have to use some
> master/slave setup(not sure though)? What I have in mind is to use master for
> ingestion of new documents and slave for querying. Then at the end of the day
> I can have the replication to update the slaves? Can you please advise if
> this is a good approach and most importantly if this is feasible?
>
> On google, I could find that many people have already setup such environment,
> but I couldn't figure out the configuration they are using. If some can share
> their experience, then it will probably help others as well.
There's a lot of information here for you to digest. Be sure to read all the
way to the end. The numbers (and the costs associated with those numbers)
might scare you.
I have no idea what's going to be in your index. I even looked up the website
for your email domain, and still don't know what you might be trying to search.
Because it uses a 32-bit signed number to track things, a single Solr index
(not sharded) is limited to a little more than 2 billion documents. This means
you'll want to use distributed search (sharding) from the beginning. For new
deployments, SolrCloud is much better than trying to handle sharding yourself.
You'll probably want 8 or more shards and a minimum replication factor of 2, so
that you have two copies of every shard. That doesn't necessarily mean that
you'll need that many machines, but you might want to plan on at least four of
them.
You'll probably be putting more than one shard per server.
SolrCloud is a true cluster - there is no master and no slaves. Both indexing
and queries are completely distributed. Clients that you write in Java get
these distributed features with no extra requirements, non-Java clients will
require some form of external load balancing to ensure that they are always
talking to a server that's up.
The absolute minimum number of physical machines you need for SolrCloud is
three. Two of those need to be the beefy workhorses. Each of them will run
Solr and a standalone ZooKeeper. The third machine can be modest and will just
run a third instance of zookeeper. If you have more than two servers that will
run Solr, then you can just run the standalone zookeeper on three of them and
won't need any extra hardware.
Memory is going to be your real problem with a very large index. When it comes
to the amount of required memory, you might want to read this wiki page, then
come back here:
http://wiki.apache.org/solr/SolrPerformanceProblems
I really was serious about reading that page, and not just because I wrote it.
The information you'll find there is key to understanding the scale of what you
propose and what I'm going to say below.
Even with very small documents, an index with 2 billion of them is probably
going to be at least 100GB, and quite possibly 300GB, 500GB, or larger.
For discussion purposes, let's say that you've got the extremely conservative
index size of 100GB and you're going to put that on four servers. To cache
this in