We're currently running Solr 3.5 and our indexing process works as follows:  

We have a master that has a cron job to run a delta import via DIH every 5 
minutes. The delta-import  takes around 75 minutes to full complete, most of 
that is due to optimization after each delta and then the slaves sync up. Our 
index is around 30 gigs so after delta-importing it takes a few minutes to sync 
to each slave and causes a huge increase in disk I/O and thus slowing down the 
machine to an unusable state. To get around this we have a rolling upgrade 
process whereas one slave at a time takes itself offline and then syncs and 
then brings itself back up. Gross… i know. When we want to run a full-import, 
which could take upwards of 30 hours, we run it on a separate solr master while 
the first solr master continues to delta-import. When the staging solr master 
is finally done importing we copy over the index to the main solr master which 
will then sync up with the slaves. This has been working for us but it 
obviously has it flaws.

I've been looking into completely re-writing our architecture to utilize Solr 
Cloud to help us with some of these pain points, if it makes sense. Please let 
me know how Solr 4.0 and Solr Cloud could help. 

I also have the following questions.
Does DIH work with Solr Cloud?
Can Solr Cloud utilize the whole cluster to index in parallel to remove the 
burden of one machine from performing that task. If so, how is it balanced 
across all nodes? Can this work with DIH
When we decide to run a full-import how can we due this and not affect our 
existing cluster since there is no real master/slave and obviously no staging 
"master"?

Thanks in advance!

- M

Reply via email to