On Thu, 2014-03-06 at 08:17 +0100, Chia-Chun Shih wrote:
>    1. Raw data is 35,000 CSV files per day. Each file is about 5 MB.
>    2. One collection serves one day. 200-day history data is required.

So once your data are indexed, they will not change? If seems to me that
1 shard/day is a fine choice. Consider optimizing down to a single
segment when a days data has been indexed.

It sounds like your indexing needs CPU power, while your searches are
likely to be I/O bound. You might consider a dedicated indexing machine,
if it is acceptable that data only go live when a day's indexing has
been finished (and copied). 

>    3. Take less than 10 hours to build one-day index.
>    4. Allow to execute an ordinary query (may span 1~7 days) in 10 minutes

Since you want to have 200 days and each day takes about 60GB (guessing
from your test), we're looking at 12TB of index at any time.

At the State and University Library, Denmark, we are building an index
for our web archive. We estimate about 20TB of static index to begin
with. We have done some tests of up to 16*200GB clouded indexes (details
at http://sbdevel.wordpress.com/2013/12/06/danish-webscale/ ) and our
median was about 1200ms for simple queries with light faceting, when we
used a traditional spinning drives backend. That put our estimated
median search time for the full corpus at 10 seconds, which was too slow
for us.

With a response time requirement of 10 minutes, which seems extremely
generous in these sub-second times, I am optimistic that "Just make
daily blocks and put them on traditional storage" will work for you.
Subject to your specific data and queries of course.

If you want a whole other level of performance then use SSDs as your
backend. Especially for your large index scenario, where it is very
expensive to try and compensate for slow spinning drives with RAM. We
designed our search machine around commodity SSDs (Samsung 840) and it
was, relative to data size and performance, dirt cheap.

>    5. concurrent user < 10

Our measurements showed that for this amount of data on spinning drives,
throughput was nearly independent of threads: 4 concurrent requests
meant 4 times as long as a single request. YMMW.

Your corpus does represent an extra challenge as it sounds like most of
the indexes will be dormant most of the time. As disk cache favours
often accessed data, I'm guessing that you will get some very ugly
response times when you process one of the rarer queries.

> I have built an experimental SolrCloud based on 3 VMs, each equipped with 8
> cores, 64GB RAM.  Each collection has 3 shards and no replication. Here are
> my findings:
> 
>    1. Each collection's actual index size is between 30GB to 90GB,
>    depending on the number of stored field.

I'm guessing that 30-90GB is a day's worth of data? How many documents
does a shard contain?

>    2. It takes 6 to 12 hours to load raw data. I use multiple (15~30)
>    threads to launch http requests. (http://wiki.apache.org/solr/UpdateCSV)

I'm guessing that profiling, tweaking and fiddling will shave the top 2
hours from those numbers.

Regards,
Toke Eskildsen, State and University Library, Denmark


Reply via email to