On Thu, 2014-03-06 at 08:17 +0100, Chia-Chun Shih wrote: > 1. Raw data is 35,000 CSV files per day. Each file is about 5 MB. > 2. One collection serves one day. 200-day history data is required.
So once your data are indexed, they will not change? If seems to me that 1 shard/day is a fine choice. Consider optimizing down to a single segment when a days data has been indexed. It sounds like your indexing needs CPU power, while your searches are likely to be I/O bound. You might consider a dedicated indexing machine, if it is acceptable that data only go live when a day's indexing has been finished (and copied). > 3. Take less than 10 hours to build one-day index. > 4. Allow to execute an ordinary query (may span 1~7 days) in 10 minutes Since you want to have 200 days and each day takes about 60GB (guessing from your test), we're looking at 12TB of index at any time. At the State and University Library, Denmark, we are building an index for our web archive. We estimate about 20TB of static index to begin with. We have done some tests of up to 16*200GB clouded indexes (details at http://sbdevel.wordpress.com/2013/12/06/danish-webscale/ ) and our median was about 1200ms for simple queries with light faceting, when we used a traditional spinning drives backend. That put our estimated median search time for the full corpus at 10 seconds, which was too slow for us. With a response time requirement of 10 minutes, which seems extremely generous in these sub-second times, I am optimistic that "Just make daily blocks and put them on traditional storage" will work for you. Subject to your specific data and queries of course. If you want a whole other level of performance then use SSDs as your backend. Especially for your large index scenario, where it is very expensive to try and compensate for slow spinning drives with RAM. We designed our search machine around commodity SSDs (Samsung 840) and it was, relative to data size and performance, dirt cheap. > 5. concurrent user < 10 Our measurements showed that for this amount of data on spinning drives, throughput was nearly independent of threads: 4 concurrent requests meant 4 times as long as a single request. YMMW. Your corpus does represent an extra challenge as it sounds like most of the indexes will be dormant most of the time. As disk cache favours often accessed data, I'm guessing that you will get some very ugly response times when you process one of the rarer queries. > I have built an experimental SolrCloud based on 3 VMs, each equipped with 8 > cores, 64GB RAM. Each collection has 3 shards and no replication. Here are > my findings: > > 1. Each collection's actual index size is between 30GB to 90GB, > depending on the number of stored field. I'm guessing that 30-90GB is a day's worth of data? How many documents does a shard contain? > 2. It takes 6 to 12 hours to load raw data. I use multiple (15~30) > threads to launch http requests. (http://wiki.apache.org/solr/UpdateCSV) I'm guessing that profiling, tweaking and fiddling will shave the top 2 hours from those numbers. Regards, Toke Eskildsen, State and University Library, Denmark