Have you considered collection aliasing? You can create an alias that
points to multiple collections. So you could keep specific collections
and have aliases that encompass your regions....

The one caveat here is that sorting the final result set by score will
require that the collections be roughly similar in terms of TF/IDF.

Best,
Erick

On Mon, Nov 13, 2017 at 11:33 AM, Shamik Bandopadhyay <sham...@gmail.com> wrote:
> Hi,
>
>     I'm looking for some input on design considerations for defining
> collections in a SolrCloud cluster. Right now, our cluster consists of two
> collections in a 2 shard / 2 replica mode. Each collection has a dedicated
> set of source and don't overlap, which made it an easy decision.
> Recently, we've a requirement to index a bunch of new sources that are
> region based. The search result corresponding to those region needs to come
> from their specific source as well sources from one of our existing
> collection. Here's an example of our existing collection and their
> corresponding source(s).
>
> Existing Collection:
> --------------------------
> Collection A --> Source_A, Source_B
> Collection B --> Source_C, Source_D, Source_E
>
> Proposed Collection:
> ----------------------------
> Collection_Asia --> Source_Asia, Source_C, Source_D, Source_E
> Collection_Europe --> Source_Europe, Source_C, Source_D, Source_E
> Collection_Australia --> Source_Asutralia, Source_C, Source_D, Source_E
>
> The proposed collection part shows that each geo has its dedicated source
> as well as source(s) from existing collection B.
>
> Just wondering if creating a dedicated collection for each geo is the right
> approach here. The main motivation is to support a geo-specific relevancy
> model which can easily be customized without stepping into each other. On
> the downside, I'm not sure if it's a good idea to replicate data from the
> same source across various collections. Moreover, the data within the
> source are not relational, so joining across collection might not be
> an easy proposition.
> The other consideration is the hardware design. Right now, both shards and
> their replicas run on their dedicated instance. With two collections, we
> sometimes run into OOM scenarios, so I'm a little bit worried about adding
> more collections. Does the best practice (I know it's subjective) in
> scenarios like this call for a dedicated Solr cluster per collection? From
> index size perspective, Source_C,Source_D and Source_E combines close to10
> million documents with 60gb volume size. Each geo based source is small,
> won't exceed more than 500k documents.
>
> Any pointers will be appreciated.
>
> Thanks,
> Shamik

Reply via email to