Hi Toke and Charile,
Thanks for sharing your cases and your well suggestions. After reading
through your mails, I'll delve into SolrCloud. One thing I'd like to share with
all of people in maillist: Chinese corpus could create a dramatically large
size of index with respect to what tokenization method we use. CJK tokenizer
falls into this situation and synonym becomes very gard to establish (For every
Chinese words composed by n Chinese character, I have to set C(n,2) synonym
items, which is impossible for me to do it). Unfortunately, my old Solr 3.5
server uses it.
This time I have two choices for Chinese tokenizing:
1> Algorithmic way: "Using standard tokenizer + query with quoting Chinese word
as phrase (by quoting with double quotes). This maintains a fairly size of
index and carry out the effect similar with Google's search. This also makes
synonym be feasible practically.
2> Dictionary-oriented way: I install mmseg4j (The theory is from Taiwan but
carried out by China). But the problem is on how to maintain an up-to-date
dictionary, especially for "news". News may create never-before new nouns as
the times goes by.
You both have experiences installing a newspaper search site. I'd like to know
what way of tokenizer you use and how do you maintain an up-to-date dictionary
assume if you use the 2nd way. I know most of the Solr tokenizers in China uses
the 2nd way, so I'm curious how they maintain an up-to-date dictionary. If
there's someone has experiences running a Chinese-corpus Solr server, I'll
appreciate if you're willing to share your case.
Thanks again and best regards, you guys save many time for my job ^_^
scott.chu,[email protected]
2016/5/12 (週四)
----- Original Message -----
From: Toke Eskildsen
To: solr-user ; scott(自己)
CC:
Date: 2016/5/11 (週三) 18:55
Subject: Re: [scottchu] What kind of configuration to use for this size ofnews
data?
On Wed, 2016-05-11 at 11:27 +0800, scott.chu wrote:
> I want to build a Solr engine for over 60-year news articles. My
> requests are (I use Solr 5.4.1):
Charlie Hull has given you an fine answer, which I agree with fully, so
I'll just add a bit from our experience.
We are running a similar service for Danish newspapers. We have 16M
OCR'ed pages, split into 250M+ articles, for 1.4TB total index size.
Everything in a single shard on a 64GB machine with SSDs.
We do faceting, range faceting and grouping as part of basic search.
That works okay (sub-second response times) for the bulk of our
requests, but when the hitCount gets above 10M, performance gets poor.
For the real heavy hitters, basically matching everything, we encounter
20 second response times.
This is not acceptable, so we will be switching to SolrCloud and
multiple shards (on the same machine, as our bottleneck is single
CPU-core performance). However, you have a smaller corpus and the growth
rate does not look alarming.
Putting all this together, I would advice you to try and put everything
in a single shard to avoid the overhead of distributed search. If that
performs well enough for single queries, then add replicas with
SolrCloud to get redundancy and scale throughput. Should you need to
shard at a later time, this will be easy with SolrCloud.
- Toke Eskildsen, State and University Library, Denmark
-----
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6189 / 病毒庫: 4568/12206 - 發佈日期: 05/10/16