Re: Multiple collections vs multiple shards for multitenancy

Erick Erickson Sat, 06 May 2017 07:37:47 -0700

Well, it's not either/or. And you haven't said how many tenants we're
talking about here. Solr startup times for a single _instance_ of Solr
when there are thousands of collections can be slow.

But note what I am talking about here: A single Solr on a single node
where there are hundreds and hundreds of collections (or replicas for
that matter). I know of very large installations with 100s of
thousands of _replicas_ that run. Admittedly with a lot of care and
feeding...

Sharding a single large collection and using custom routing to push
tenants to a single shard will be an administrative problem for you.
I'm assuming you have the typical multi-tenant problems, a bunch of
tenants have around N docs, some smaller percentage have 3N and a few
have 100N. Now you're having to keep track of how many docs are on
each shard, do the routing yourself, etc. Plus you can't commit
individually, a commit on one will _still_ commit on all so you're
right back where you started.

I've seen people use a hybrid approach: experiment with how many
_documents_ you can have in a collection (however you partition that
up) and use the multi-tenant approach. So you have N collections and
each collection has a (varying) number of tenants. This also tends to
flatten out the update process on the assumption that your smaller
tenants also don't update their data as often.

However, I really have to question one of your basic statements:

"This works fine with aggressive autowarming, but I have a need to reduce my NRT
search capabilities to seconds as opposed to the minutes it is at now,"...

The implication here is that your autowarming takes minutes. Very
often people severely overdo the warmup by setting their autowarm
counts to 100s or 1000s. This is rarely necessary, especially if you
use docValues fields appropriately. Very often much of autowarming is
"uninverting" fields (look in your Solr log). Essentially for any
field you see this, use docValues and loading will be much faster.

You also haven't said how many documents you have in a shard at
present. This is actually the metric I use most often to size
hardware. I claim you can find a sweet spot where minimal autowarming
will give you good enough performance, and that number is what you
should design to. Of course YMMV.

Finally: push back really hard on how aggressive NRT support needs to
be. Often "requirements" like this are made without much thought as
"faster is better, let's make it 1 second!". There are situations
where that's true, but it comes at a cost. Users may be better served
by a predictable but fast system than one that's fast but
unpredictable. "Documents may take up to 5 minutes to appear and
searches will usually take less than a second" is nice and concise. I
have my expectations. "Documents are searchable in 1 second, but the
results may not come back for between 1 and 10 seconds" is much more
frustrating.

FWIW,
Erick

On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cptroul...@gmail.com> wrote:
> Hi,
>
> I use Solr to serve multiple tenants and currently all tenant's data
> resides in one large collection, and queries have a tenant identifier. This
> works fine with aggressive autowarming, but I have a need to reduce my NRT
> search capabilities to seconds as opposed to the minutes it is at now,
> which will mean drastically reducing if not eliminating my autowarming. As
> such I am considering splitting my index out by tenant so that when one
> tenant modifies their data it doesn't blow away all of the searcher based
> caches for all tenants on soft commit.
>
> I have done a lot of research on the subject and it seems like Solr Cloud
> can have problems handling large numbers of collections. I'm obviously
> going to have to run some tests to see how it performs, but my main
> question is this: are there pros and cons to splitting the index into
> multiple collections vs having 1 collection but splitting into multiple
> shards? In my case I would have a shard per tenant and use implicit routing
> to route to that specific shard. As I understand it a shard is basically
> it's own lucene index, so I would still be eating that overhead with either
> approach. What I don't know is if there are any other overheads involved
> WRT collections vs shards, routing, zookeeper, etc.
>
> Thanks,
>
> Chris

Re: Multiple collections vs multiple shards for multitenancy

Reply via email to