One other option is to index "somewhere else", then use the collections API to "addreplica"s on your prod cluster. Then perhaps delete replica on the nodes that are "somewhere else".
Best, Erick On Jun 21, 2016 4:27 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote: There’s no official way of doing #1, but there are some less official ways: 1. The Backup/Restore API provides some hooks into loading pre-existing data dirs into an existing collection. Lots of caveats. 2. If you don’t have many shards, there’s always rsync/reload. 3. There are some third-party tools that help with this kind of thing: a. https://github.com/whitepages/solrcloud_manager (primarily a command line tool) b. https://github.com/bloomreach/solrcloud-haft (primarily a library) For #2, absolutely. Spin up some new nodes in your cluster, and then use the “createNodeSet” parameter when creating the new collection to restrict to those new nodes: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1 On 6/21/16, 12:33 PM, "Kelly, Frank" <frank.ke...@here.com> wrote: >We have about 200 million documents (~70 GB) we need to keep indexed across 3 collections. > >Currently 2 of the 3 collections are already indexed (roughly 90m docs). > >We'd like to create the remaining collection (about 100 m documents) but minimizing the performance impact on the existing collections on Solr servers during that Time. > >Is there some way to do this either by > > 1. Creating the collection in another environment and shipping the (underlying Lucene) index files > 2. Creating the collection on (dedicated) new machines that we add to the SolrCloud cluster? > >Thoughts, comments or suggestions appreciated, > >Best > >-Frank Kelly >