Re: [scottchu] What kind of configuration to use for this size ofnews data?

scott.chu Wed, 11 May 2016 02:35:04 -0700

Hi, Charlie,

    Thanks first for your concrete answer. I have further questions as written 
in blue color below.

scott.chu，scott....@udngroup.com
2016/5/11 (週三)
----- Original Message ----- 
From: Charlie Hull 
To: solr-user@lucene.apache.org 
CC: 
Date: 2016/5/11 (週三) 16:21
Subject: Re: [scottchu] What kind of configuration to use for this size ofnews 
data?

On 11/05/2016 04:27, scott.chu wrote: 
> Fix some typos, add some words and resend same question => 
> 
> I want to build a Solr engine for over 60-year news articles. My 
> requests are (I use Solr 5.4.1): 

Hi Scott, 

We've actually done something very similar for the our client NLA Media 
Access in the UK, who handle licensing of most UK newspaper content. 
They have over 45m docs going back to 2006. 
> 
> 1> Currently over 10M no. of docs. 2> Currently over 60GB total data 
> size. 3> The no. of docs and data size will keep growing at the rate 
> of 1000 no. of docs(or 8MB size) per day. 4> There are totally 5-6 
> different newspaper types. 
> 
> My questions are: 1> Is it wokable enough just to use master-slave 
> model? Or should I turn to SolrCloud? (I ask this due to our system 
> management group never manage a distributed system before and they 
> also have no knowedge of Zookeeper, shards, etc. Also they don't know 
> how to backup/restore distributed data.) 

Workable yes, advisable no. You should get much better reliability & 
performance with SolrCloud once it's set up. Also, if you have 
replication set up correctly the need for backup/restore will be 
significantly reduced and may be unnecessary. 

We used master-slave for News UK's Solr setup (articles from The Times 
and other papers) but this was before SolrCloud had properly arrived. 
We'd only use master-slave rarely now. 

If I use SolrCloud, I know I have to setup Zookeeper. I know there're something 
called 'quorum' or 'ensemble' in Zookeeper terminologies. I also know there is 
a need for (2n+1) Zookeeper nodes per n SolrCloud nodes.  Is your case running 
one SolrCloud node per one machine (Whether PM or VM).  According to your 
experiences, how many nodes , including SolrCloud's and Zookeeper's, do I need 
to setup? Is Replication in SolrCloud easy to setup as that in old version? (I 
setup replication solrconfig.xml and use solrcore.properties file to 
setup/switch roles in Solr node, rather than defining role directly in 
solrconfig.xml)

> 2> Say if I choose Solrcloud anyway. I wish to keep one shard owning 
> one specific year of data. Can it be done? 

Yes it can, but it may not be a good idea. If a large proportion of your 
queries hit recent news you may find one shard dealing with more queries 
than the others and becoming overloaded. Here's a blog post we wrote a 
long time ago about this - ignore the name Xapian, this applies to Solr 
as well: 
http://www.flax.co.uk/blog/2009/04/25/distributed-search-and-partition-functions/

What configuration should 
> I do? (AFAIK, SolrCloud distributes data based on some intrinsic 
> routing algorithm.) 

You can choose how to route data at indexing time: 
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

You are right. I do neglect this condition.  I'll think twice and could drop 
out my idea. Thanks for sharing the blog article. I'll take a good look at it.

>3> If I wish to create another Solr engine with 
> one or two particular paper types. Is it possible to copy their index 
> data directly from the big central Solr engine? Or I have to rebuild 
> index from raw articles data? (Our business has this possibility of 
> needs.) 

Yes, I guess so, but why copy it when you could just search it with a 
filter for the paper types? 

We have a special biz case called 'buyout newspaper search service'. Customers 
buy intranet license to use search service for articles of some newspaper types 
and some range of  publish dates, e.g. paper type 'A' for 2010-2012 and paper 
type 'B' for 2015. The buyout means we have to install who search service at 
customer site and customer can only use search service within their enterprise 
intranet environment. So you know, I have to build a special Solr server for 
each of such customers. Your idea of filtering is very much like 
ElasticSearch's multitenancy, which both are not fit in our buyout biz model. 
Do you have any suggestion for building Solr server in such condition?

> 
> I'd like to hear and use some well suggestion and experiences. 
> 
> Thanks in advance and best regards. 
> 
> Scott Chu @ 2016/5/11 11:26 GMT+8 
> 

Hope this helps! 

Cheers 

Charlie 

-- 
Charlie Hull 
Flax - Open Source Enterprise Search 

tel/fax: +44 (0)8700 118334 
mobile: +44 (0)7767 825828 
web: www.flax.co.uk 

----- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6189 / 病毒庫: 4568/12206 - 發佈日期: 05/10/16

Re: [scottchu] What kind of configuration to use for this size ofnews data?

Reply via email to