Re: Solr Cloud sharding strategy

2016-03-07 Thread Erick Erickson
What do you mean "the rest of the cluster"? The routing is based on the key provided. All of the "enu" prefixes will go to one of your shards. All the "deu" docs will appear on one shard. All the "esp" will be on one shard. All the "chs" docs will be on one shard. Which shard will each go to? Good

Re: Solr Cloud sharding strategy

2016-03-07 Thread shamik
Thanks Eric and Walter, this is extremely insightful. One last followup question on composite routing. I'm trying to have a better understanding of index distribution. If I use language as a prefix, SolrCloud guarantees that same language content will be routed to the same shard. What I'm curious t

Re: Solr Cloud sharding strategy

2016-03-07 Thread Walter Underwood
Excellent advice, and I’d like to reinforce a few things. * Solr indexing is CPU intensive and generates lots of disk IO. Faster CPUs and faster disks matter a lot. * Realistic user query logs are super important. We measure 95th percentile latency and that is dominated by rare and malformed que

Re: Solr Cloud sharding strategy

2016-03-07 Thread Erick Erickson
Still, 50M is not excessive for a single shard although it's getting into the range that I'd like proof that my hardware etc. is adequate before committing to it. I've seen up to 300M docs on a single machine, admittedly they were tweets. YMMV based on hardware and index complexity of course. Here'

Re: Solr Cloud sharding strategy

2016-03-07 Thread shamik
Thanks a lot, Erick. You are right, it's a tad small with around 20 million documents, but the growth projection around 50 million in next 6-8 months. It'll continue to grow, but maybe not at the same rate. From the index size point of view, the size can grow up to half a TB from its current state.

Re: Solr Cloud sharding strategy

2016-03-07 Thread Erick Erickson
20M docs is actually a very small collection by the "usual" Solr standards unless they're _really_ large documents, i.e. large books. Actually, I wouldn't even shard to begin with, it's unlikely that it's necessary and it adds inevitable overhead. If you _must_ shard, just go with <1>, but again I