Patrick Henry [patricktheawesomeg...@gmail.com] wrote: >I am working with a Solr collection that is several terabytes in size over > several hundred millions of documents. Each document is very rich, and > over the past few years we have consistently quadrupled the size our > collection annually. Unfortunately, this sits on a single node with only a > few hundred megabytes of memory - so our performance is less than ideal.
I assume you mean gigabytes of memory. If you have not already done so, switching to SSDs for storage should buy you some more time. > [Going for SolrCloud] We are in a continuous adding documents and never > change > existing ones. Based on that, one individual recommended for me to > implement custom hashing and route the latest documents to the shard with > the least documents, and when that shard fills up add a new shard and index > on the new shard, rinse and repeat. We have quite a similar setup, where we produce a never-changing shard once every 8 days and add it to our cloud. One could also combine this setup with a single live shard, for keeping the full index constantly up to date. The memory overhead of running an immutable shard is smaller than a mutable one and easier to fine-tune. It also allows you to optimize the index down to a single segment, which requires a bit less processing power and saves memory when faceting. There's a description of our setup at http://sbdevel.wordpress.com/net-archive-search/ >From an administrative point of view, we like having complete control over >each shard. We keep track of what goes in it and in case of schema or analyze >chain changes, we can re-build each shard one at a time and deploy them >continuously, instead of having to re-build everything in one go on a parallel >setup. Of course, fundamental changes to the schema would require a complete >re-build before deploy, so we hope to avoid that. - Toke Eskildsen