Hi Toke and Charile, Thanks for sharing your cases and your well suggestions. After reading through your mails, I'll delve into SolrCloud. One thing I'd like to share with all of people in maillist: Chinese corpus could create a dramatically large size of index with respect to what tokenization method we use. CJK tokenizer falls into this situation and synonym becomes very gard to establish (For every Chinese words composed by n Chinese character, I have to set C(n,2) synonym items, which is impossible for me to do it). Unfortunately, my old Solr 3.5 server uses it.
This time I have two choices for Chinese tokenizing: 1> Algorithmic way: "Using standard tokenizer + query with quoting Chinese word as phrase (by quoting with double quotes). This maintains a fairly size of index and carry out the effect similar with Google's search. This also makes synonym be feasible practically. 2> Dictionary-oriented way: I install mmseg4j (The theory is from Taiwan but carried out by China). But the problem is on how to maintain an up-to-date dictionary, especially for "news". News may create never-before new nouns as the times goes by. You both have experiences installing a newspaper search site. I'd like to know what way of tokenizer you use and how do you maintain an up-to-date dictionary assume if you use the 2nd way. I know most of the Solr tokenizers in China uses the 2nd way, so I'm curious how they maintain an up-to-date dictionary. If there's someone has experiences running a Chinese-corpus Solr server, I'll appreciate if you're willing to share your case. Thanks again and best regards, you guys save many time for my job ^_^ scott.chu,scott....@udngroup.com 2016/5/12 (週四) ----- Original Message ----- From: Toke Eskildsen To: solr-user ; scott(自己) CC: Date: 2016/5/11 (週三) 18:55 Subject: Re: [scottchu] What kind of configuration to use for this size ofnews data? On Wed, 2016-05-11 at 11:27 +0800, scott.chu wrote: > I want to build a Solr engine for over 60-year news articles. My > requests are (I use Solr 5.4.1): Charlie Hull has given you an fine answer, which I agree with fully, so I'll just add a bit from our experience. We are running a similar service for Danish newspapers. We have 16M OCR'ed pages, split into 250M+ articles, for 1.4TB total index size. Everything in a single shard on a 64GB machine with SSDs. We do faceting, range faceting and grouping as part of basic search. That works okay (sub-second response times) for the bulk of our requests, but when the hitCount gets above 10M, performance gets poor. For the real heavy hitters, basically matching everything, we encounter 20 second response times. This is not acceptable, so we will be switching to SolrCloud and multiple shards (on the same machine, as our bottleneck is single CPU-core performance). However, you have a smaller corpus and the growth rate does not look alarming. Putting all this together, I would advice you to try and put everything in a single shard to avoid the overhead of distributed search. If that performs well enough for single queries, then add replicas with SolrCloud to get redundancy and scale throughput. Should you need to shard at a later time, this will be easy with SolrCloud. - Toke Eskildsen, State and University Library, Denmark ----- 未在此訊息中找到病毒。 已透過 AVG 檢查 - www.avg.com 版本: 2015.0.6189 / 病毒庫: 4568/12206 - 發佈日期: 05/10/16