Schema/config changes and re-indexing in a SolrCloud setup

Steffen Elberg Godskesen Thu, 07 Feb 2013 14:23:50 -0800

Hi Solr community

I'm in the process of getting my mind set straight on SolrCloud; more 
specifically: trying to design a feasible workflow for a use-case where we 
currently use master/slave replication. First, the use case:


We want to
  1. separate indexing workload from query workload
  2. deploying config and/or schema changes without interrupting queries

Currently we do (1) with a straight-forward master/slave replication setup. N 
master shards that handle updates and N slave shards replicating from these. In 
this setup we can do (2) by temporarily stopping replication, deploying new 
configuration/schema to master shards, possibly re-indexing, switching queries 
to go the master shards, re-enabling replication, and - when replication has 
finished - switching queries back to the slave shards

So... introducing SolrCloud. We would really like to utilize SolrCloud, 
especially for the added fault-tolerance and simpler distributed indexing, but 
I'm a bit puzzled on how to achieve something similar to the above.

Re (1): Am I right in thinking that a given update is sent to every replica of 
the shard to which it belongs for analysis and indexing? And that there is no 
immediate way to separate indexing from queries within a collection? 

Re (2): Deploying new schema/config should be as simple as uploading to 
ZooKeeper and reloading cores. Right? So for the case where the new 
config/schema is compatible with the index we're good. For the other case, I 
think we could do it by: Create a new collection, upload the new config/schema 
to zookeeper, index into the new collection, switch queries to the new 
collection, delete the old collection. Would this be the way to go? Or is there 
a simpler way that I cannot see?


Just to bring the scale of our operation into it: Our index is approx. 200 
million documents, with a total index size around 0.5TB. The normal flow of 
updates is in the order of a few million/day, but we will frequently (say on a 
weekly basis) need to re-index all or large parts of our documents. Either due 
to schema changes or re-processing of the original data.


Sorry for dumping my brain on you, but any input you might have on this, will 
be highly appreciated.

Regards, 

-- 
Steffen Elberg Godskesen
Programmer
DTU Library
---------------------------------------
Technical University of Denmark
Technical Information Center of Denmark
Anker Engelunds Vej 1
PO Box 777
Building 101D
2800 Kgs. Lyngby
s...@dtic.dtu.dk
http://www.dtic.dtu.dk/

Schema/config changes and re-indexing in a SolrCloud setup

Reply via email to