20M docs is actually a very small collection by the "usual" Solr standards unless they're _really_ large documents, i.e. large books.
Actually, I wouldn't even shard to begin with, it's unlikely that it's necessary and it adds inevitable overhead. If you _must_ shard, just go with <1>, but again I would be surprised if it was even necessary. Best, Erick On Mon, Mar 7, 2016 at 2:35 PM, Shamik Bandopadhyay <sham...@gmail.com> wrote: > Hi, > > I'm trying to figure the best way to design/allocate shards for our Solr > Cloud environment.Our current index has around 20 million documents, in 10 > languages. Around 25-30% of the content is in English. Rest are almost > equally distributed among the remaining 13 languages. Till now, we had to > deal with query time deduplication using collapsing parser for which we > used multi-level composite routing. But due to that, documents were > disproportionately distributed across 3 shards. The shard containing the > duplicate data ended up hosting 80% of the index. For e.g. Shard1 had a > 30gb index while Shard2 and Shard3 10gb each. The composite key is > currently made of "language!dedup_id!url" . At query time, we are using > shard.keys=language/8! for three level routing. > > Due to performance overhead, we decided to move the de-duplication logic > during index time which made the composite routing redundant. We are not > discarding the duplicate content so there's no change in index size.Before > I update the routing key, just wanted to check what will be the best > approach to the sharding architecture so that we get optimal performance. > We've currently have 3 shards wth 2 replicas each. The entire index resides > in one single collection. What I'm trying to understand is whether: > > 1. We let Solr use simple document routing based on id and route the > documents to any of the 3 shards > 2. We create a composite id using language, e.g. language!unique_id and > make sure that the same language content will always be in same the shard. > What I'm not sure is whether the index will be equally distributed across > the three shards. > 3. Index English only content to a dedicated shard, rest equally > distributed to the remaining two. I'm not sure if that's possible. > 4. Create a dedicated collection for English and one for rest of the > languages. > > Any pointers on this will be highly appreciated. > > Regards, > Shamik