Hi Rekha,
In addition to what Shawn explained, the answer to your last question is yes 
and no: You can split shards, but cannot change number of shards without 
reindexing. And you can add nodes but you should make sure adding nodes will 
result in well balanced cluster.
You can address scalability issues differently. Depending on your case, you 
might not need to have a single index with 200 billion documents. E.g. if you 
have multi-tenant system and each tenant search only its own data, each tenant 
or group of tenants can have a separate index or even separate cluster. Also if 
you append data and often filter by time, you may have time based indices.

Here is blog explaining how to run tests to estimate shard/cluster size: 
http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html 
<http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 2 Oct 2018, at 22:41, Shawn Heisey <apa...@elyograg.org> wrote:
> 
> On 10/2/2018 9:33 AM, Rekha wrote:
>> Dear Solr Team, I need following clarification from you, please check and 
>> give suggestion to me, 1. I want to store and search 200 Billions of 
>> documents(Each document contains 16 fields). For my case can I able to 
>> achieve by using Solr cloud? 2. For my case how many shard and nodes will be 
>> needed? 3. In future can I able to increase the nodes and shards? Thanks, 
>> Rekha Karthick
> 
> In a nutshell:  It's not possible to give generic advice. The contents of the 
> fields will affect exactly what you need.  The nature of the queries that you 
> send will affect exactly what you need.  The query rate will affect exactly 
> what you need. The overall size of the index (disk space, as well as document 
> count) will affect what you need.
> 
> In the "not very helpful" department, but I promise this is absolute truth, 
> there's this blog post:
> 
> https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> 
> To handle 200 billion documents *in a single collection*, you're probably 
> going to want at least 200 shards, and there are good reasons to go with even 
> more shards than that.  But you need to be warned that there can be serious 
> scalability problems when SolrCloud must keep track of that many different 
> indexes.  Here's an issue I filed for scalability problems with thousands of 
> collections ... there can be similar problems with lots of shards as well.  
> This issue says it is fixed, but no code changes that I am aware of were ever 
> made related to the issue, and as far as I can tell, it's still a problem 
> even in the latest version:
> 
> https://issues.apache.org/jira/browse/SOLR-7191
> 
> That many shard/replicas on one collection is likely to need zookeeper's 
> maximum znode size (jute.maxbuffer) boosted, because it will probably require 
> more than one megabyte to hold the JSON structure describing the collection.
> 
> As for how many machines you'll need ... absolutely no idea.  If query rate 
> will be insanely high, you'll want a dedicated machine for each shard 
> replica, and you may need many replicas, which is going to mean hundreds, 
> possibly thousands, of servers.  If the query rate is really low and/or each 
> document is very small, you might be able to house more than one shard per 
> server.  But you should know that handling 200 billion documents is going to 
> require a lot of hardware even if it turns out that you're not going to be 
> handling tons of data (per document) or queries.
> 
> Thanks,
> Shawn
> 

Reply via email to