Hi, Charlie, Thanks first for your concrete answer. I have further questions as written in blue color below.
scott.chu,scott....@udngroup.com 2016/5/11 (週三) ----- Original Message ----- From: Charlie Hull To: solr-user@lucene.apache.org CC: Date: 2016/5/11 (週三) 16:21 Subject: Re: [scottchu] What kind of configuration to use for this size ofnews data? On 11/05/2016 04:27, scott.chu wrote: > Fix some typos, add some words and resend same question => > > I want to build a Solr engine for over 60-year news articles. My > requests are (I use Solr 5.4.1): Hi Scott, We've actually done something very similar for the our client NLA Media Access in the UK, who handle licensing of most UK newspaper content. They have over 45m docs going back to 2006. > > 1> Currently over 10M no. of docs. 2> Currently over 60GB total data > size. 3> The no. of docs and data size will keep growing at the rate > of 1000 no. of docs(or 8MB size) per day. 4> There are totally 5-6 > different newspaper types. > > My questions are: 1> Is it wokable enough just to use master-slave > model? Or should I turn to SolrCloud? (I ask this due to our system > management group never manage a distributed system before and they > also have no knowedge of Zookeeper, shards, etc. Also they don't know > how to backup/restore distributed data.) Workable yes, advisable no. You should get much better reliability & performance with SolrCloud once it's set up. Also, if you have replication set up correctly the need for backup/restore will be significantly reduced and may be unnecessary. We used master-slave for News UK's Solr setup (articles from The Times and other papers) but this was before SolrCloud had properly arrived. We'd only use master-slave rarely now. If I use SolrCloud, I know I have to setup Zookeeper. I know there're something called 'quorum' or 'ensemble' in Zookeeper terminologies. I also know there is a need for (2n+1) Zookeeper nodes per n SolrCloud nodes. Is your case running one SolrCloud node per one machine (Whether PM or VM). According to your experiences, how many nodes , including SolrCloud's and Zookeeper's, do I need to setup? Is Replication in SolrCloud easy to setup as that in old version? (I setup replication solrconfig.xml and use solrcore.properties file to setup/switch roles in Solr node, rather than defining role directly in solrconfig.xml) > 2> Say if I choose Solrcloud anyway. I wish to keep one shard owning > one specific year of data. Can it be done? Yes it can, but it may not be a good idea. If a large proportion of your queries hit recent news you may find one shard dealing with more queries than the others and becoming overloaded. Here's a blog post we wrote a long time ago about this - ignore the name Xapian, this applies to Solr as well: http://www.flax.co.uk/blog/2009/04/25/distributed-search-and-partition-functions/ What configuration should > I do? (AFAIK, SolrCloud distributes data based on some intrinsic > routing algorithm.) You can choose how to route data at indexing time: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud You are right. I do neglect this condition. I'll think twice and could drop out my idea. Thanks for sharing the blog article. I'll take a good look at it. >3> If I wish to create another Solr engine with > one or two particular paper types. Is it possible to copy their index > data directly from the big central Solr engine? Or I have to rebuild > index from raw articles data? (Our business has this possibility of > needs.) Yes, I guess so, but why copy it when you could just search it with a filter for the paper types? We have a special biz case called 'buyout newspaper search service'. Customers buy intranet license to use search service for articles of some newspaper types and some range of publish dates, e.g. paper type 'A' for 2010-2012 and paper type 'B' for 2015. The buyout means we have to install who search service at customer site and customer can only use search service within their enterprise intranet environment. So you know, I have to build a special Solr server for each of such customers. Your idea of filtering is very much like ElasticSearch's multitenancy, which both are not fit in our buyout biz model. Do you have any suggestion for building Solr server in such condition? > > I'd like to hear and use some well suggestion and experiences. > > Thanks in advance and best regards. > > Scott Chu @ 2016/5/11 11:26 GMT+8 > Hope this helps! Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk ----- 未在此訊息中找到病毒。 已透過 AVG 檢查 - www.avg.com 版本: 2015.0.6189 / 病毒庫: 4568/12206 - 發佈日期: 05/10/16