On 7/13/2015 1:49 PM, Raja Pothuganti wrote: > We are setting up a new SolrCloud environment with 5.2.1 on Ubuntu boxes. We > currently ingest data into a large collection, call it LIVE. After the full > ingest is done we then trigger a delta delta ingestion every 15 minutes to > get the documents & data that have changed into this LIVE instance. > > In Solr 4.X using a Master / Slave setup we had slaves that would > periodically (weekly, or monthly) refresh their data from the Master rather > than every 15 minutes. We're now trying to figure out how to get this same > type of setup using SolrCloud. > > Question(s): > - Is there a way to copy data from one SolrCloud collection into another > quickly and easily? > - Is there a way to programmatically control when a replica receives it's > data or possibly move it to another collection (without losing data) that > updates on a different interval? It ideally would be another collection > name, call it Week1 ... Week52 ... to avoid a replica in the same collection > serving old data. > > One option we thought of was to create a backup and then restore that into a > new clean cloud. This has a lot of moving parts and isn't nearly as neat as > the Master / Slave controlled replication setup. It also has the side effect > of potentially taking a very long time to backup and restore instead of just > copying the indexes like the old M/S setup.
SolrCloud works very differently than replication. When you send an indexing request, the documents are forwarded to the leader replica of the shard that will index them. The leader indexes the documents locally and sends a copy to all other replicas, each of which independently indexes those documents. There's no need to copy finished indexes (or even index segments) around -- each shard replica builds itself incrementally in parallel with the others as you index new documents. There is no polling interval -- replicas change at nearly the same time when you do an index update. Rather than separate collections for each week, you might want to consider using the implicit router on a single collection and creating a new *shard* for each week. This would be done with the CREATESHARD action on the collections API. The implicit router does create a new wrinkle for indexing -- you cannot index to the entire collection ... you must specifically index to one of the replicas for that specific shard. There might be some way to indicate on the update request which shard it should go to, but I haven't examined SolrCloud requests in that much detail. As for copying indexes ... the newest versions of Solr include a backup/restore API, but if your indexes are very large, this will be quite slow. TL;DR info: With enough digging, you will learn that SolrCloud *does* require a replication handler, which might be very confusing, since I've just told you that it's very different from replication. That handler is *only* used when a replica requires recovery. Recovery might be required because a replica has been down too long, has been newly created, or some similar situation. It is NOT used during normal SolrCloud operation. "Collections are made up of one or more shards. Shards have one or more replicas. Each replica is a core." https://cwiki.apache.org/confluence/display/solr/How+SolrCloud+Works There's a lot of info in a small space here. It will hopefully be enough for you to find more detail in the Solr documentation, the wiki, or possibly other locations. Thanks, Shawn