On 3/6/2014 12:17 AM, Chia-Chun Shih wrote: > I am planning a system for searching TB's of structured data in SolrCloud. > I need suggestions for handling such huge amount of data in SolrCloud. > (e.g., number of shards per collection, number of nodes, etc.) > > Here are some specs of the system: > > 1. Raw data is 35,000 CSV files per day. Each file is about 5 MB. > 2. One collection serves one day. 200-day history data is required. > 3. Take less than 10 hours to build one-day index. > 4. Allow to execute an ordinary query (may span 1~7 days) in 10 minutes > 5. concurrent user < 10 > > I have built an experimental SolrCloud based on 3 VMs, each equipped with 8 > cores, 64GB RAM. Each collection has 3 shards and no replication. Here are > my findings: > > 1. Each collection's actual index size is between 30GB to 90GB, > depending on the number of stored field. > 2. It takes 6 to 12 hours to load raw data. I use multiple (15~30) > threads to launch http requests. (http://wiki.apache.org/solr/UpdateCSV)
Nobody can give you any specific answers because there are simply too many variables: http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ You do have one unusually loose restriction there -- that the query must take less than 10 minutes. Most people tend to say that it must take less than a second, but they'll settle for several seconds. Almost any reasonable way you could architect your system will probably take less than ten minutes for a query. With this much data and potentially a LOT of servers, you might run into limits that require config changes to address. Things like the thread limits on the servlet container, connection limits on the shard handler in Solr, etc. These blog posts (there are two pages of them) may interest you: http://www.hathitrust.org/blogs/large-scale-search One thing that I can tell you is that the more RAM you can get your hands on, the better it will perform. Ideally you'd have as much free memory across the whole system as the entire size of your Solr indexes. The problem with this idea for you is that with 200 collections averaging 60GB, that's about twelve terabytes of memory across all your servers -- for one single copy of the index. You'll probably want at least two copies, so you can survive at least one hardware failure. If you can't get enough RAM to cache the whole index, putting the index data on SSD can make a MAJOR difference. Some strong advice: do everything you can to reduce the size of your index, which reduces the OS disk cache (RAM) requirements. Don't store all your fields. Use less aggressive tokenization where possible. Avoid termVectors and docValues unless they are actually needed. Omit anything you can -- term frequencies, positions, norms, etc. http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn