Although technically it may be possible to put 1 billion documents in a
single Solr/Lucene index (2 billion hard limit), I would recommend simply:
Don't do it! Don't try to put more than 250 million documents on a single
Solr node. In fact, 100 million is a better, more realistic limit.
To be clear, it does depend on your schema and actual data and query
complexity (e.g., faceting, highlighting, sorting, etc.) Not to mention your
query latency requirements. And then query load will determine your
replication needs.
Sure, if you are an uber-guru you can probably make it work, but if you are
sane, go with a sharded cluster such as SolrCloud.
In any case, always start with a proof of concept. Load up a test system
(cluster) with representative dummy data and see how it performs. Trying to
predict hardware needs in advance for a Solr deployment is just a very bad
idea. Proof of concept first! That provides the best"prediction".
In summary, I recommend 100 to 250 million as a best target for documents
per node for a proof of concept, and then your actual testing will confirm
whether the actual target should be higher or lower. In some cases, with
complex queries and low latency requirements, 50 or even 10 million might be
a more realistic target, while in other cases with really simple data 400 or
500 million documents might actually work.
-- Jack Krupansky
-----Original Message-----
From: pankaj.pand...@wipro.com
Sent: Wednesday, May 15, 2013 1:00 AM
To: solr-user@lucene.apache.org
Subject: Billion document index
Hi,
We have to setup a billion document index using Apache Solr(about 2 billion
docs). I need some assistance on choosing the right configuration for the
environment setup. I have been going through the Solr documentation, but
couldn't figure out what would be the best configuration for same.
Below is the configuration on one of the box we have, can you please assist
if this will suffice our requirement.
OS: SunOS
RAM : 32GB
Processor: 4 (SPARC64-VII+/2660Mhz)
Server: Sun SPARC Enterprise M5000
Desired System Requirements:
Expected number of requests/day: 10,000
New Documents ingested/day: 1Million
I have below questions on same:
1. Does the above system configuration seems ok for our requirement?
2. Is it ok if we host entire index on a single physical box or I
should use multiple physical box?
3. Should I go for the simple installation or SolrCloud?
4. If I should use SolrCloud, then probably I may have to use some
master/slave setup(not sure though)? What I have in mind is to use master
for ingestion of new documents and slave for querying. Then at the end of
the day I can have the replication to update the slaves? Can you please
advise if this is a good approach and most importantly if this is feasible?
On google, I could find that many people have already setup such
environment, but I couldn't figure out the configuration they are using. If
some can share their experience, then it will probably help others as well.
Thanks!
Regards,
Pankaj
Please do not print this email unless it is absolutely necessary.
The information contained in this electronic message and any attachments to
this message are intended for the exclusive use of the addressee(s) and may
contain proprietary, confidential or privileged information. If you are not
the intended recipient, you should not disseminate, distribute or copy this
e-mail. Please notify the sender immediately and destroy all copies of this
message and any attachments.
WARNING: Computer viruses can be transmitted via email. The recipient should
check this email and any attachments for the presence of viruses. The
company accepts no liability for any damage caused by any virus transmitted
by this email.
www.wipro.com