We're currently running Solr 3.5 and our indexing process works as follows:
We have a master that has a cron job to run a delta import via DIH every 5 minutes. The delta-import takes around 75 minutes to full complete, most of that is due to optimization after each delta and then the slaves sync up. Our index is around 30 gigs so after delta-importing it takes a few minutes to sync to each slave and causes a huge increase in disk I/O and thus slowing down the machine to an unusable state. To get around this we have a rolling upgrade process whereas one slave at a time takes itself offline and then syncs and then brings itself back up. Gross… i know. When we want to run a full-import, which could take upwards of 30 hours, we run it on a separate solr master while the first solr master continues to delta-import. When the staging solr master is finally done importing we copy over the index to the main solr master which will then sync up with the slaves. This has been working for us but it obviously has it flaws. I've been looking into completely re-writing our architecture to utilize Solr Cloud to help us with some of these pain points, if it makes sense. Please let me know how Solr 4.0 and Solr Cloud could help. I also have the following questions. Does DIH work with Solr Cloud? Can Solr Cloud utilize the whole cluster to index in parallel to remove the burden of one machine from performing that task. If so, how is it balanced across all nodes? Can this work with DIH When we decide to run a full-import how can we due this and not affect our existing cluster since there is no real master/slave and obviously no staging "master"? Thanks in advance! - M