Hi,

I am planning a system for searching TB's of structured data in SolrCloud.
I need suggestions for handling such huge amount of data in SolrCloud.
(e.g., number of shards per collection, number of nodes, etc.)

Here are some specs of the system:

   1. Raw data is 35,000 CSV files per day. Each file is about 5 MB.
   2. One collection serves one day. 200-day history data is required.
   3. Take less than 10 hours to build one-day index.
   4. Allow to execute an ordinary query (may span 1~7 days) in 10 minutes
   5. concurrent user < 10

I have built an experimental SolrCloud based on 3 VMs, each equipped with 8
cores, 64GB RAM.  Each collection has 3 shards and no replication. Here are
my findings:

   1. Each collection's actual index size is between 30GB to 90GB,
   depending on the number of stored field.
   2. It takes 6 to 12 hours to load raw data. I use multiple (15~30)
   threads to launch http requests. (http://wiki.apache.org/solr/UpdateCSV)


Thanks,
Chia-Chun

Reply via email to