Emir...

Thanks for the input. Our larger collections are localized content, so it may make sense to shard those so we can target the specific index. I'll need to confirm how it's being used, if queries are always within a language or if they are cross-language.

Thanks also for the link .. very helpful!

All the best,
...scott



On 3/14/18 2:21 AM, Emir Arnautović wrote:
Hi Scott,
There is no definite answer - it depends on your documents and query patterns. 
Sharding does come with an overhead but also allows Solr to parallelise search. 
Query latency is usually something that tells you if you need to split 
collection to multiple shards or not. In caseIf you are ok with latency there 
is no need to split. Other scenario where shards make sense is when routing is 
used in majority of queries so that enables you to query only subset of 
documents.
Also, there is indexing aspect where sharding helps - in case of high indexing 
throughput is needed, having multiple shards will spread indexing load to 
multiple servers.
It seems to me that there is no high indexing throughput requirement and the 
main criteria should be query latency.
Here is another blog post talking about this subject: 
http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html 
<http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



On 14 Mar 2018, at 01:01, Scott Prentice <s...@leximation.com> wrote:

We're in the process of moving from 12 single-core collections (non-cloud Solr) 
on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in size 
from 50K to 150K documents with one at 1.2M docs. Our max query frequency is 
rather low .. probably no more than 10-20/min. We do update frequently, maybe 
10-100 documents every 10 mins.

Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got each 
collection split into 2 shards with 3 replicas (one per VM). Also, Zookeeper is 
running on each VM. I understand that it's best to have each ZK server on a 
separate machine, but hoping this will work for now.

This all seemed like a good place to start, but after reading lots of articles 
and posts, I'm thinking that maybe our smaller collections (under 100K docs) 
should just be one shard each, and maybe the 1.2M collection should be more 
like 6 shards. How do you decide how many shards is right?

Also, our current live system is separated into dev/stage/prod tiers, not, all 
of these tiers are together on each of the cloud VMs. This bothers some people, 
thinking that it may make our production environment less stable. I know that 
in an ideal world, we'd have them all on separate systems, but with the 
replication, it seems like we're going to make the overall system more stable. 
Is this a correct understanding?

I'm just wondering anyone has opinions on whether we're going in a reasonable 
direction or not. Are there any articles that discuss these initial 
sizing/scoping issues?

Thanks!
...scott




Reply via email to