20M docs is actually a very small collection by the "usual" Solr
standards unless they're _really_ large documents, i.e.
large books.

Actually, I wouldn't even shard to begin with, it's unlikely that it's
necessary and it adds inevitable overhead. If you _must_ shard,
just go with <1>, but again I would be surprised if it was even
necessary.

Best,
Erick

On Mon, Mar 7, 2016 at 2:35 PM, Shamik Bandopadhyay <sham...@gmail.com> wrote:
> Hi,
>
>   I'm trying to figure the best way to design/allocate shards for our Solr
> Cloud environment.Our current index has around 20 million documents, in 10
> languages. Around 25-30% of the content is in English. Rest are almost
> equally distributed among the remaining 13 languages. Till now, we had to
> deal with query time deduplication using collapsing parser  for which we
> used multi-level composite routing. But due to that, documents were
> disproportionately distributed across 3 shards. The shard containing the
> duplicate data ended up hosting 80% of the index. For e.g. Shard1 had a
> 30gb index while Shard2 and Shard3 10gb each. The composite key is
> currently made of "language!dedup_id!url" . At query time, we are using
> shard.keys=language/8! for three level routing.
>
> Due to performance overhead, we decided to move the de-duplication logic
> during index time which made the composite routing redundant. We are not
> discarding the duplicate content so there's no change in index size.Before
> I update the routing key, just wanted to check what will be the best
> approach to the sharding architecture so that we get optimal performance.
> We've currently have 3 shards wth 2 replicas each. The entire index resides
> in one single collection. What I'm trying to understand is whether:
>
> 1. We let Solr use simple document routing based on id and route the
> documents to any of the 3 shards
> 2. We create a composite id using language, e.g. language!unique_id and
> make sure that the same language content will always be in same the shard.
> What I'm not sure is whether the index will be equally distributed across
> the three shards.
> 3. Index English only content to a dedicated shard, rest equally
> distributed to the remaining two. I'm not sure if that's possible.
> 4. Create a dedicated collection for English and one for rest of the
> languages.
>
> Any pointers on this will be highly appreciated.
>
> Regards,
> Shamik

Reply via email to