My answer remains the same - a large number of collections (cores) in a single Solr instance is not one of the ways in which Solr is designed to scale. To repeat, there are only two ways to scale Solr, number of documents and number of nodes.
-- Jack Krupansky On Sun, Jun 14, 2015 at 11:00 AM, Shai Erera <ser...@gmail.com> wrote: > Thanks Jack for your response. But I think Arnon's question was different. > > If you need to index 10,000 different collection of documents in Solr (say > a collection denotes someone's Dropbox files), then you have two options: > index all collections in one Solr collection, and add a field like > collectionID to each document and query, or index each user's private > collection in a different Solr collection. > > The pros of the latter is that you don't need to add a collectionID filter > to each query. Also from a security/privacy standpoint (and search quality) > - a user can only ever search what he has access to -- e.g. it cannot get a > spelling correction for words he never saw in his documents, nor document > suggestions (even though the 'context' in some of Lucene suggesters allow > one to do that too). From a quality standpoint you don't mix different term > statistics etc. > > So from a single node's point of view, you can either index 100M documents > in one index (Collection, shard, replica -- whatever -- a single Solr > core), or in 10,000 such cores. From node capacity perspectives the two are > the same -- same amount of documents will be indexed overall, same query > workload etc. > > So the question is purely about Solr and its collections management -- is > there anything in that process that can prevent one from managing thousands > of collections on a single node, or within a single SolrCloud instance? If > so, what is it -- are these the ZK watchers? Is there a thread per > collection at work? Others? > > Shai > > On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <jack.krupan...@gmail.com> > wrote: > > > As a general rule, there are only two ways that Solr scales to large > > numbers: large number of documents and moderate number of nodes (shards > and > > replicas). All other parameters should be kept relatively small, like > > dozens or low hundreds. Even shards and replicas should probably kept > down > > to that same guidance of dozens or low hundreds. > > > > Tens of millions of documents should be no problem. I recommend 100 > million > > as the rough limit of documents per node. Of course it all depends on > your > > particular data model and data and hardware and network, so that number > > could be smaller or larger. > > > > The main guidance has always been to simply do a proof of concept > > implementation to test for your particular data model and data values. > > > > -- Jack Krupansky > > > > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <arn...@il.ibm.com> wrote: > > > > > We're running some tests on Solr and would like to have a deeper > > > understanding of its limitations. > > > > > > Specifically, We have tens of millions of documents (say 50M) and are > > > comparing several "#collections X #docs_per_collection" configurations. > > > For example, we could have a single collection with 50M docs or 5000 > > > collections with 10K docs each. > > > When trying to create the 5000 collections, we start getting frequent > > > errors after 1000-1500 collections have been created. Feels like some > > > limit has been reached. > > > These tests are done on a single node + an additional node for replica. > > > > > > Can someone elaborate on what could limit Solr to a high number of > > > collections (if at all)? > > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there > > > anything in Solr that can prevent it? Where would it break? > > > > > > Thanks, > > > Arnon > > >