Re(2): [scottchu] What kind of configuration to use for this size ofnews data?

scott.chu Wed, 11 May 2016 19:26:39 -0700

Hi Toke and Charile,

    Thanks for sharing your cases and your well suggestions. After reading 
through your mails, I'll delve into SolrCloud. One thing I'd like to share with 
all of people in maillist: Chinese corpus could  create a dramatically large 
size of index with respect to what tokenization method we use. CJK tokenizer 
falls into this situation and synonym becomes very gard to establish (For every 
Chinese words composed by n Chinese character, I have to set C(n,2) synonym 
items, which is impossible for me to do it). Unfortunately, my old Solr 3.5 
server uses it.

This time I have two choices for Chinese tokenizing:

1> Algorithmic way: "Using standard tokenizer + query with quoting Chinese word 
as phrase (by quoting with double quotes). This maintains a fairly size of 
index and carry out the effect similar with Google's search. This also makes 
synonym be feasible practically.
2> Dictionary-oriented way: I install mmseg4j (The theory is from Taiwan but 
carried out by China). But the problem is on how to maintain an up-to-date 
dictionary, especially for "news". News may create never-before new nouns as 
the times goes by.

You both have experiences installing a newspaper search site. I'd like to know 
what way of tokenizer you use and how do you maintain an up-to-date dictionary 
assume if you use the 2nd way. I know most of the Solr tokenizers in China uses 
the 2nd way, so I'm curious how they maintain an up-to-date dictionary. If 
there's someone has experiences running a Chinese-corpus Solr server, I'll 
appreciate if  you're willing to share your case.

    Thanks again and best regards, you guys save many time for my job ^_^

scott.chu，scott....@udngroup.com
2016/5/12 (週四)
----- Original Message ----- 
From: Toke Eskildsen 
To: solr-user ; scott(自己) 
CC: 
Date: 2016/5/11 (週三) 18:55
Subject: Re: [scottchu] What kind of configuration to use for this size ofnews 
data?

On Wed, 2016-05-11 at 11:27 +0800, scott.chu wrote: 
> I want to build a Solr engine for over 60-year news articles. My 
> requests are (I use Solr 5.4.1): 

Charlie Hull has given you an fine answer, which I agree with fully, so 
I'll just add a bit from our experience. 

We are running a similar service for Danish newspapers. We have 16M 
OCR'ed pages, split into 250M+ articles, for 1.4TB total index size. 
Everything in a single shard on a 64GB machine with SSDs. 

We do faceting, range faceting and grouping as part of basic search. 
That works okay (sub-second response times) for the bulk of our 
requests, but when the hitCount gets above 10M, performance gets poor. 
For the real heavy hitters, basically matching everything, we encounter 
20 second response times. 

This is not acceptable, so we will be switching to SolrCloud and 
multiple shards (on the same machine, as our bottleneck is single 
CPU-core performance). However, you have a smaller corpus and the growth 
rate does not look alarming. 

Putting all this together, I would advice you to try and put everything 
in a single shard to avoid the overhead of distributed search. If that 
performs well enough for single queries, then add replicas with 
SolrCloud to get redundancy and scale throughput. Should you need to 
shard at a later time, this will be easy with SolrCloud. 

- Toke Eskildsen, State and University Library, Denmark 

----- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6189 / 病毒庫: 4568/12206 - 發佈日期: 05/10/16

Re(2): [scottchu] What kind of configuration to use for this size ofnews data?

Reply via email to