On Wed, 2016-05-11 at 11:27 +0800, scott.chu wrote:
> I want to build a Solr engine for over 60-year news articles. My
> requests are (I use Solr 5.4.1):

Charlie Hull has given you an fine answer, which I agree with fully, so
I'll just add a bit from our experience.

We are running a similar service for Danish newspapers. We have 16M
OCR'ed pages, split into 250M+ articles, for 1.4TB total index size.
Everything in a single shard on a 64GB machine with SSDs.

We do faceting, range faceting and grouping as part of basic search.
That works okay (sub-second response times) for the bulk of our
requests, but when the hitCount gets above 10M, performance gets poor.
For the real heavy hitters, basically matching everything, we encounter
20 second response times.

This is not acceptable, so we will be switching to SolrCloud and
multiple shards (on the same machine, as our bottleneck is single
CPU-core performance). However, you have a smaller corpus and the growth
rate does not look alarming.


Putting all this together, I would advice you to try and put everything
in a single shard to avoid the overhead of distributed search. If that
performs well enough for single queries, then add replicas with
SolrCloud to get redundancy and scale throughput. Should you need to
shard at a later time, this will be easy with SolrCloud.

- Toke Eskildsen, State and University Library, Denmark


Reply via email to