Hi Shushuai,
Yes, as Robi noted, you have to be careful with terminology: core
generally refers to the traditional Solr configuration of a single index
+ configuration on a single node (optionally replicated to others). A
collection is a distributed index that is associated with a
configuration (but multiple collections can be associated with the same
configuration).
A collection is still a single index, however, just like a core - its
just spread out across however many nodes you have and replicated
according to your chosen replication factor. You can do multi-tenancy
with cores and collections, but via different strategies.
More inline ...
On 15/03/2014 19:17, shushuai zhu wrote:
Hi Lajos, thanks again.
Your suggestion is to support multi-tenant via collection in a Solr Cloud:
putting small tenants in one collection and big tenants in their own
collections.
My original question was to find out which approach is better: supporting
multi-tenant at collection level or core level. Based on the links below and a
few comments there, it seems people more prefer at core level. Collection is
logical and core is physical. I am trying to figure out the trade-offs between
the approaches regarding to scalability, security, performance, and
flexibility. My understanding might be wrong, the belows are some rough
comparison:
1) Scalability
Core is more scalable than collection by number: we can have much more cores
than collections in one Solr Cloud? Or collection is more scalable than core by
size: a collection could be much bigger than a core? Not sure which one is
better: having ~1000 cores or ~1000 collections in a Solr Cloud.
SolrCloud is more scalable in terms of index size. Plus you get
redundancy which can't be underestimated in a hosted solution.
2) Security
Core is more isolated than collection: core is physical and has its own index,
but collection is logical so multiple collections may contain the same cores?
No: cores are not less or more isolated than collections. Both support
multi-tenancy, albeit in different ways. If you do it in a core with
some prefix or special field, you just have to be aware of security
implications. As Robi said is easily enforced by the middle tier; I use
Spring for this, in my case.
3) Performance
Core has better performance control since it has its own index? Collection
index is bigger so performance is not as good as smaller core index?
Not really. You might want to test this, however, to verify with your
specific hardware configuration.
4) Flexibilty
Core is more flexible since it has its own schema/config, but one collection
may have multiple cores hence multiple schemas/configs? Or it does not matter
since we can set same schema/config for the whole collection?
One could argue that the easiest configuration will be one big
collection (or maybe divided up intelligently amongst several big
collections). More complex is 1000s of cores or collections.
The issue is management. 1000s of cores/collections require a level of
automation. On the other hand, having a single core/collection means if
you make one change to the schema or solrconfig, it affects everyone.
That might not work if you have frequent changes or differing tenant needs.
This is a decision you'll have to make yourself, based on your client
needs, change management, index sizes, management system, etc, etc.
Regards,
Lajos
Basically, I just want to get opinions about which approach might be better for
the given use case.
Regards.
Shushuai
________________________________
From: Lajos <la...@protulae.com>
To: solr-user@lucene.apache.org
Sent: Saturday, March 15, 2014 1:19 PM
Subject: Re: Best practice to support multi-tenant with Solr
Hi Shushuai,
---------------------------
Finally, I would (in general) argue for cloud-based implementations to give you
data redundancy ...
---------------------------
Do you mean using multi-sharding to have multiple replicas of cores
(corresponding to tenants) across nodes?
Shushuai
What I means first and foremost is that using SolrCloud with replication
ensures that your data isn't lost if you lose a note. So in a hosted
solution, that's a good thing.
If you are using SolrCloud, then its up to you to choose whether to have
one collection per tenant, or one collection that supports multiple
tenants via document routing.
Obviously the former has implications on the number of shards you'll
have. For example, if you have a 3-node cluster with replication factor
of 2, that's 6 shards per collection. If you have 1,000 tenant
collections, that's 6,000 shards. Hence my argument for multiple low-end
tenants per collection, and then only give your higher-end tenants their
own collections. Just to make things simpler for you ;)
Regards,
Lajos
________________________________
From: Lajos <la...@protulae.com>
To: solr-user@lucene.apache.org
Sent: Saturday, March 15, 2014 5:37 AM
Subject: Re: Best practice to support multi-tenant with Solr
Hi Shushuai,
Just a few thoughts.
I would guess that most people would argue for implementing
multi-tenancy within your core (via some unique filter ID) or collection
(via document routing) because of the headache of managing individual
cores at the scale you are talking about.
There are disadvantages the other way too: having a core/collection
support multiple tenants does affect scoring, since TF-IDF is calculated
across the index, and can open up security implications that you have to
address (i.e. making sure a malicious query cannot get another tenants
documents).
The most important thing you have to lock down is whether there is a
need to customize the schema/solrconfig for each tenant. If there is,
then having individual cores per tenant is going to be a stronger
argument. If I was to guess, and based on my own multi-tenant
experience, you'll have some high-end tenants who need their own
cores/collections, and a larger number that can all share a
configuration. Its like any kind of hosted solution: the cheapest
version is one-size-fits-all and involves the minimum of management
overhead, while the higher end are more expensive and require more
management.
My own preference is for a blended environment. While the management of
individual cores/collections is not to be taken lightly, I've done it in
a variety of hosting situations and it all comes down to smart
management and the intelligent use of administrative scripts. I've
developed my own set of tools over the years and they work quite well.
Finally, I would (in general) argue for cloud-based implementations to
give you data redundancy, but that decision would require more information.
HTH,
Lajos Moczar
theconsultantcto.com
Enterprise Lucene/Solr
On 14/03/2014 23:10, shushuai zhu wrote:
Hi,
I am looking into Solr 4.7 for best practice of multi-tenancy support. Our use
cases require support of thousands of tenants (say 10,000) and the incoming
data rate could be more than 10k documents per second. I did some research and
found people talked about scaling tenants at all four levels:
Solr Cloud
Collection
Shard
Core
I am listing them plus some quoted comments from the links.
1) Solr Cloud and Collection
http://find.searchhub.org/document/c7caa34d807a8a1b#c7caa34d807a8a1b
-----------
Are you trying to do "multi-tenant"? If so, you should be talking
"multi-cluster" where you externally manage your "tenants",
assigning them to clusters, but keeping tenants per cluster down in
the dozens/hundreds, and "archiving" inactive tenants and spinning
up (and down) clusters as inactive tenants become active or fall
into inactivity. But keeping 1,000 or more tenants active in a
single cluster as separate collections is... a no-go.
-----------
2) Shard
http://searchhub.org/2013/06/13/solr-cloud-document-routing/
-----------
Document routing can be used to achieve a more efficient
multi-tenant environment. This can be done by making the tenant id
the shard key, which would group all documents from the same tenant
on the same shard.
-----------
3) Core
http://find.searchhub.org/document/4312991db2dd90e9#4312991db2dd90e9
-----------
Every multitenant situation is going to be different, but at the
extreme a single core per tenant is the cleanest and provides the
best separation, optimal performance, and supports full tf-idf
relevancy of document fields for each tenant.
-----------
http://find.searchhub.org/document/fc5b734fba135e83#fc5b734fba135e83
-----------
Well, we try to use Solr to run a multi-tenant index/search
service. We assigns each client a different core with their own
config and schema. It would be good for us if we can just let the
customer to be able to create cores with their own schema and
config.
-----------
I also saw slides talking about scaling time along Collection: timed
collections (slides 50 ~ 58)
http://www.slideshare.net/sematext/solr-for-indexing-and-searching-logs
According to these, I am thinking about the following approach:
In a single Solr Cloud, the multi-tenant support is at Core level
(one or more cores per tenant), and for better performance, will
create a collection every day. When a tenant grows too big, will
migrate it from this Solr cloud to a new Solr Cloud.
Any potential issue with this approach? Is there better approach
based on your experience?
A few questions related to proposed approach:
1) When a core is replicated to multiple nodes via multiple shards,
the query submitted against a particular core (tenant) should be
executed distributed, right?
2) What is the best way to move a core from one Solr Cloud to
another?
3) If we create one collection per day and want to keep data for
three years for example, is it OK to have so many collections? If
yes, is it cheap to maintain the collection alias for easy querying?
Thanks.
Shushuai