On 11/05/2016 10:55, scott.chu wrote:
I just find maillist seems not accept colorful fonts (cause I receive
my own letter from maillist and see blue colors are gone!). I use
asterisk row to highlight my questions and send this again.
Answers inline below.
C
----- Original Message ----- From: scott(自己) To: solr-user To: Date:
2016/5/11 (週三) 17:34 Subject: Re: [scottchu] What kind of
configuration to use for this size ofnews data?
Hi, Charlie,
Thanks first for your concrete answer. I have further questions as
written in blue color below.
scott.chu,scott....@udngroup.com 2016/5/11 (週三) ----- Original
Message ----- From: Charlie Hull To: solr-user@lucene.apache.org CC:
Date: 2016/5/11 (週三) 16:21 Subject: Re: [scottchu] What kind of
configuration to use for this size ofnews data?
On 11/05/2016 04:27, scott.chu wrote:
Fix some typos, add some words and resend same question =>
I want to build a Solr engine for over 60-year news articles. My
requests are (I use Solr 5.4.1):
Hi Scott,
We've actually done something very similar for the our client NLA
Media Access in the UK, who handle licensing of most UK newspaper
content. They have over 45m docs going back to 2006.
1> Currently over 10M no. of docs. 2> Currently over 60GB total
data size. 3> The no. of docs and data size will keep growing at
the rate of 1000 no. of docs(or 8MB size) per day. 4> There are
totally 5-6 different newspaper types.
My questions are: 1> Is it wokable enough just to use master-slave
model? Or should I turn to SolrCloud? (I ask this due to our
system management group never manage a distributed system before
and they also have no knowedge of Zookeeper, shards, etc. Also they
don't know how to backup/restore distributed data.)
Workable yes, advisable no. You should get much better reliability &
performance with SolrCloud once it's set up. Also, if you have
replication set up correctly the need for backup/restore will be
significantly reduced and may be unnecessary.
We used master-slave for News UK's Solr setup (articles from The
Times and other papers) but this was before SolrCloud had properly
arrived. We'd only use master-slave rarely now.
*************************************************************************************************************************************************************
If I use SolrCloud, I know I have to setup Zookeeper. I know there're
something called 'quorum' or 'ensemble' in Zookeeper terminologies. I
also know there is a need for (2n+1) Zookeeper nodes per n SolrCloud
nodes. Is your case running one SolrCloud node per one machine (Whether
PM or VM). According to your experiences, how many nodes , including
SolrCloud's and Zookeeper's, do I need to setup? Is Replication in
SolrCloud easy to setup as that in old version? (I setup replication
solrconfig.xml and use solrcore.properties file to setup/switch roles in
Solr node, rather than defining role directly in solrconfig.xml)
*************************************************************************************************************************************************************
You need at least 3 ZK nodes to form a quorum. How many SolrClouds you
need will depend on how you decide to shard and replicate your data.
There is no single answer to this - it depends on various factors
including query load, query complexity, source data size, indexing
strategy...you should read this page.
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
You can run more than one Solr node per machine, but if that machine
dies then your failover setup must be able to cope.
The *only* sensible way to figure out how many nodes you need is to try
out a prototype system. I would guesstimate it will be less than 10
nodes but don't hold me to that! Doing this will also teach you a lot
about ZK and SolrCloud - you're not going to be able to avoid some
learning here. Don't avoid looking at SolrCloud just because it involves
ZK, the advantages outweigh the learning curve IMO.
3> If I wish to create another Solr engine with one or two
particular paper types. Is it possible to copy their index data
directly from the big central Solr engine? Or I have to rebuild
index from raw articles data? (Our business has this possibility
of needs.)
Yes, I guess so, but why copy it when you could just search it with
a filter for the paper types?
*************************************************************************************************************************************************************We
have a special biz case called 'buyout newspaper search service'.
Customers buy intranet license to use search service for articles of
some newspaper types and some range of publish dates, e.g. paper
type 'A' for 2010-2012 and paper type 'B' for 2015. The buyout means
we have to install who search service at customer site and customer
can only use search service within their enterprise intranet
environment. So you know, I have to build a special Solr server for
each of such customers. Your idea of filtering is very much like
ElasticSearch's multitenancy, which both are not fit in our buyout
biz model. Do you have any suggestion for building Solr server in
such condition?
*************************************************************************************************************************************************************
You could use Solr's API to extract the subset of articles for
papers/dates for reindexing into a new Solr core.
Best
Charlie
I'd like to hear and use some well suggestion and experiences.
Thanks in advance and best regards.
Scott Chu @ 2016/5/11 11:26 GMT+8
Hope this helps!
Cheers
Charlie
--
Charlie Hull
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk