Although technically it may be possible to put 1 billion documents in a single Solr/Lucene index (2 billion hard limit), I would recommend simply: Don't do it! Don't try to put more than 250 million documents on a single Solr node. In fact, 100 million is a better, more realistic limit.

To be clear, it does depend on your schema and actual data and query complexity (e.g., faceting, highlighting, sorting, etc.) Not to mention your query latency requirements. And then query load will determine your replication needs.

Sure, if you are an uber-guru you can probably make it work, but if you are sane, go with a sharded cluster such as SolrCloud.

In any case, always start with a proof of concept. Load up a test system (cluster) with representative dummy data and see how it performs. Trying to predict hardware needs in advance for a Solr deployment is just a very bad idea. Proof of concept first! That provides the best"prediction".

In summary, I recommend 100 to 250 million as a best target for documents per node for a proof of concept, and then your actual testing will confirm whether the actual target should be higher or lower. In some cases, with complex queries and low latency requirements, 50 or even 10 million might be a more realistic target, while in other cases with really simple data 400 or 500 million documents might actually work.

-- Jack Krupansky

-----Original Message----- From: pankaj.pand...@wipro.com
Sent: Wednesday, May 15, 2013 1:00 AM
To: solr-user@lucene.apache.org
Subject: Billion document index

Hi,

We have to setup a billion document index using Apache Solr(about 2 billion docs). I need some assistance on choosing the right configuration for the environment setup. I have been going through the Solr documentation, but couldn't figure out what would be the best configuration for same.

Below is the configuration on one of the box we have, can you please assist if this will suffice our requirement.
OS: SunOS
RAM : 32GB
Processor: 4 (SPARC64-VII+/2660Mhz)
Server: Sun SPARC Enterprise M5000

Desired System Requirements:
Expected number of requests/day: 10,000
New Documents ingested/day: 1Million

I have below questions on same:

1.       Does the above system configuration seems ok for our requirement?

2. Is it ok if we host entire index on a single physical box or I should use multiple physical box?

3.       Should I go for the simple installation or SolrCloud?

4. If I should use SolrCloud, then probably I may have to use some master/slave setup(not sure though)? What I have in mind is to use master for ingestion of new documents and slave for querying. Then at the end of the day I can have the replication to update the slaves? Can you please advise if this is a good approach and most importantly if this is feasible?

On google, I could find that many people have already setup such environment, but I couldn't figure out the configuration they are using. If some can share their experience, then it will probably help others as well.

Thanks!

Regards,
Pankaj


Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

www.wipro.com

Reply via email to