Re: need suggestions for storing TBs of strutucred data in SolrCloud

Shawn Heisey Thu, 06 Mar 2014 07:59:44 -0800

On 3/6/2014 12:17 AM, Chia-Chun Shih wrote:
> I am planning a system for searching TB's of structured data in SolrCloud.
> I need suggestions for handling such huge amount of data in SolrCloud.
> (e.g., number of shards per collection, number of nodes, etc.)
> 
> Here are some specs of the system:
> 
>    1. Raw data is 35,000 CSV files per day. Each file is about 5 MB.
>    2. One collection serves one day. 200-day history data is required.
>    3. Take less than 10 hours to build one-day index.
>    4. Allow to execute an ordinary query (may span 1~7 days) in 10 minutes
>    5. concurrent user < 10
> 
> I have built an experimental SolrCloud based on 3 VMs, each equipped with 8
> cores, 64GB RAM.  Each collection has 3 shards and no replication. Here are
> my findings:
> 
>    1. Each collection's actual index size is between 30GB to 90GB,
>    depending on the number of stored field.
>    2. It takes 6 to 12 hours to load raw data. I use multiple (15~30)
>    threads to launch http requests. (http://wiki.apache.org/solr/UpdateCSV)

Nobody can give you any specific answers because there are simply too
many variables:

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

You do have one unusually loose restriction there -- that the query must
take less than 10 minutes. Most people tend to say that it must take
less than a second, but they'll settle for several seconds. Almost any
reasonable way you could architect your system will probably take less
than ten minutes for a query.

With this much data and potentially a LOT of servers, you might run into
limits that require config changes to address. Things like the thread
limits on the servlet container, connection limits on the shard handler
in Solr, etc.

These blog posts (there are two pages of them) may interest you:

http://www.hathitrust.org/blogs/large-scale-search

One thing that I can tell you is that the more RAM you can get your
hands on, the better it will perform. Ideally you'd have as much free
memory across the whole system as the entire size of your Solr indexes.
The problem with this idea for you is that with 200 collections
averaging 60GB, that's about twelve terabytes of memory across all your
servers -- for one single copy of the index. You'll probably want at
least two copies, so you can survive at least one hardware failure. If
you can't get enough RAM to cache the whole index, putting the index
data on SSD can make a MAJOR difference.

Some strong advice: do everything you can to reduce the size of your
index, which reduces the OS disk cache (RAM) requirements. Don't store
all your fields. Use less aggressive tokenization where possible.
Avoid termVectors and docValues unless they are actually needed. Omit
anything you can -- term frequencies, positions, norms, etc.

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Re: need suggestions for storing TBs of strutucred data in SolrCloud

Reply via email to