Hi, I am planning a system for searching TB's of structured data in SolrCloud. I need suggestions for handling such huge amount of data in SolrCloud. (e.g., number of shards per collection, number of nodes, etc.)
Here are some specs of the system: 1. Raw data is 35,000 CSV files per day. Each file is about 5 MB. 2. One collection serves one day. 200-day history data is required. 3. Take less than 10 hours to build one-day index. 4. Allow to execute an ordinary query (may span 1~7 days) in 10 minutes 5. concurrent user < 10 I have built an experimental SolrCloud based on 3 VMs, each equipped with 8 cores, 64GB RAM. Each collection has 3 shards and no replication. Here are my findings: 1. Each collection's actual index size is between 30GB to 90GB, depending on the number of stored field. 2. It takes 6 to 12 hours to load raw data. I use multiple (15~30) threads to launch http requests. (http://wiki.apache.org/solr/UpdateCSV) Thanks, Chia-Chun