Re: [scottchu] What kind of configuration to use for this size ofnews data?

Charlie Hull Wed, 11 May 2016 05:06:08 -0700

On 11/05/2016 10:55, scott.chu wrote:


I just find maillist seems not accept colorful fonts (cause I receive
my own letter from maillist and see blue colors are gone!). I use
asterisk row to highlight my questions  and send this again.


Answers inline below.

C




----- Original Message ----- From: scott(自己) To: solr-user To: Date:
2016/5/11 (週三) 17:34 Subject: Re: [scottchu] What kind of
configuration to use for this size ofnews data?


Hi, Charlie,

Thanks first for your concrete answer. I have further questions as
written in blue color below.

scott.chu，scott....@udngroup.com 2016/5/11 (週三) ----- Original
Message ----- From: Charlie Hull To: solr-user@lucene.apache.org CC:
Date: 2016/5/11 (週三) 16:21 Subject: Re: [scottchu] What kind of
configuration to use for this size ofnews data?


On 11/05/2016 04:27, scott.chu wrote:

Fix some typos, add some words and resend same question =>

I want to build a Solr engine for over 60-year news articles. My
requests are (I use Solr 5.4.1):


Hi Scott,

We've actually done something very similar for the our client NLA
Media Access in the UK, who handle licensing of most UK newspaper
content. They have over 45m docs going back to 2006.


1> Currently over 10M no. of docs. 2> Currently over 60GB total
data size. 3> The no. of docs and data size will keep growing at
the rate of 1000 no. of docs(or 8MB size) per day. 4> There are
totally 5-6 different newspaper types.

My questions are: 1> Is it wokable enough just to use master-slave
model? Or should I turn to SolrCloud? (I ask this due to our
system management group never manage a distributed system before
and they also have no knowedge of Zookeeper, shards, etc. Also they
don't know how to backup/restore distributed data.)


Workable yes, advisable no. You should get much better reliability &
performance with SolrCloud once it's set up. Also, if you have
replication set up correctly the need for backup/restore will be
significantly reduced and may be unnecessary.

We used master-slave for News UK's Solr setup (articles from The
Times and other papers) but this was before SolrCloud had properly
arrived. We'd only use master-slave rarely now.


*************************************************************************************************************************************************************

If I use SolrCloud, I know I have to setup Zookeeper. I know there'resomething called 'quorum' or 'ensemble' in Zookeeper terminologies. Ialso know there is a need for (2n+1) Zookeeper nodes per n SolrCloudnodes. Is your case running one SolrCloud node per one machine (WhetherPM or VM). According to your experiences, how many nodes , includingSolrCloud's and Zookeeper's, do I need to setup? Is Replication inSolrCloud easy to setup as that in old version? (I setup replicationsolrconfig.xml and use solrcore.properties file to setup/switch roles inSolr node, rather than defining role directly in solrconfig.xml)

*************************************************************************************************************************************************************

You need at least 3 ZK nodes to form a quorum. How many SolrClouds youneed will depend on how you decide to shard and replicate your data.There is no single answer to this - it depends on various factorsincluding query load, query complexity, source data size, indexingstrategy...you should read this page.https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/You can run more than one Solr node per machine, but if that machinedies then your failover setup must be able to cope.

The *only* sensible way to figure out how many nodes you need is to tryout a prototype system. I would guesstimate it will be less than 10nodes but don't hold me to that! Doing this will also teach you a lotabout ZK and SolrCloud - you're not going to be able to avoid somelearning here. Don't avoid looking at SolrCloud just because it involvesZK, the advantages outweigh the learning curve IMO.

3> If I wish to create another Solr engine with one or two
particular paper types. Is it possible to copy their index data
directly from the big central Solr engine? Or I have to rebuild
index from raw articles data? (Our business has this possibility
of needs.)


Yes, I guess so, but why copy it when you could just search it with
a filter for the paper types?

*************************************************************************************************************************************************************We
have a special biz case called 'buyout newspaper search service'.
Customers buy intranet license to use search service for articles of
some newspaper types and some range of  publish dates, e.g. paper
type 'A' for 2010-2012 and paper type 'B' for 2015. The buyout means
we have to install who search service at customer site and customer
can only use search service within their enterprise intranet
environment. So you know, I have to build a special Solr server for
each of such customers. Your idea of filtering is very much like
ElasticSearch's multitenancy, which both are not fit in our buyout
biz model. Do you have any suggestion for building Solr server in
such condition?
*************************************************************************************************************************************************************

You could use Solr's API to extract the subset of articles forpapers/dates for reindexing into a new Solr core.


Best

Charlie

I'd like to hear and use some well suggestion and experiences.


Thanks in advance and best regards.

Scott Chu @ 2016/5/11 11:26 GMT+8


Hope this helps!

Cheers

Charlie



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: [scottchu] What kind of configuration to use for this size ofnews data?

Reply via email to