On Wed, 2016-05-11 at 11:27 +0800, scott.chu wrote: > I want to build a Solr engine for over 60-year news articles. My > requests are (I use Solr 5.4.1):
Charlie Hull has given you an fine answer, which I agree with fully, so I'll just add a bit from our experience. We are running a similar service for Danish newspapers. We have 16M OCR'ed pages, split into 250M+ articles, for 1.4TB total index size. Everything in a single shard on a 64GB machine with SSDs. We do faceting, range faceting and grouping as part of basic search. That works okay (sub-second response times) for the bulk of our requests, but when the hitCount gets above 10M, performance gets poor. For the real heavy hitters, basically matching everything, we encounter 20 second response times. This is not acceptable, so we will be switching to SolrCloud and multiple shards (on the same machine, as our bottleneck is single CPU-core performance). However, you have a smaller corpus and the growth rate does not look alarming. Putting all this together, I would advice you to try and put everything in a single shard to avoid the overhead of distributed search. If that performs well enough for single queries, then add replicas with SolrCloud to get redundancy and scale throughput. Should you need to shard at a later time, this will be easy with SolrCloud. - Toke Eskildsen, State and University Library, Denmark