Patrick Henry [patricktheawesomeg...@gmail.com] wrote:

>I am working with a Solr collection that is several terabytes in size over
> several hundred millions of documents.  Each document is very rich, and
> over the past few years we have consistently quadrupled the size our
> collection annually.  Unfortunately, this sits on a single node with only a
> few hundred megabytes of memory - so our performance is less than ideal.

I assume you mean gigabytes of memory. If you have not already done so, 
switching to SSDs for storage should buy you some more time.

> [Going for SolrCloud]  We are in a continuous adding documents and never 
> change
> existing ones.  Based on that, one individual recommended for me to
> implement custom hashing and route the latest documents to the shard with
> the least documents, and when that shard fills up add a new shard and index
> on the new shard, rinse and repeat.

We have quite a similar setup, where we produce a never-changing shard once 
every 8 days and add it to our cloud. One could also combine this setup with a 
single live shard, for keeping the full index constantly up to date. The memory 
overhead of running an immutable shard is smaller than a mutable one and easier 
to fine-tune. It also allows you to optimize the index down to a single 
segment, which requires a bit less processing power and saves memory when 
faceting. There's a description of our setup at 
http://sbdevel.wordpress.com/net-archive-search/

>From an administrative point of view, we like having complete control over 
>each shard. We keep track of what goes in it and in case of schema or analyze 
>chain changes, we can re-build each shard one at a time and deploy them 
>continuously, instead of having to re-build everything in one go on a parallel 
>setup. Of course, fundamental changes to the schema would require a complete 
>re-build before deploy, so we hope to avoid that.

- Toke Eskildsen

Reply via email to