On 12/19/2012 11:50 AM, Mark wrote:
We have a master that has a cron job to run a delta import via DIH every 5 
minutes. The delta-import  takes around 75 minutes to full complete, most of 
that is due to optimization after each delta and then the slaves sync up. Our 
index is around 30 gigs so after delta-importing it takes a few minutes to sync 
to each slave and causes a huge increase in disk I/O and thus slowing down the 
machine to an unusable state. To get around this we have a rolling upgrade 
process whereas one slave at a time takes itself offline and then syncs and 
then brings itself back up. Gross… i know. When we want to run a full-import, 
which could take upwards of 30 hours, we run it on a separate solr master while 
the first solr master continues to delta-import. When the staging solr master 
is finally done importing we copy over the index to the main solr master which 
will then sync up with the slaves. This has been working for us but it 
obviously has it flaws.

I've been looking into completely re-writing our architecture to utilize Solr 
Cloud to help us with some of these pain points, if it makes sense. Please let 
me know how Solr 4.0 and Solr Cloud could help.

I also have the following questions.
Does DIH work with Solr Cloud?
Can Solr Cloud utilize the whole cluster to index in parallel to remove the 
burden of one machine from performing that task. If so, how is it balanced 
across all nodes? Can this work with DIH
When we decide to run a full-import how can we due this and not affect our existing 
cluster since there is no real master/slave and obviously no staging "master"?

If the delta-import takes 75 minutes to complete, you should not be doing it every five minutes. As I understand it, DIH won't do more than one import at the same time anyway. If your update code has a lockout mechanism that will keep it from trying a new import until a previous one is done, then you're probably OK kicking it off every five minutes.

Also, you should not be optimizing after every import. Other people on this list will tell you that you should *never* optimize. My opinion on it is that if you delete or reindex existing documents regularly, you should optimize on a very long interval. If you never delete or reindex documents, then optimization is unnecessary. Optimization is very I/O intensive, as you have likely noticed.

For really large indexes, you probably shouldn't optimize more than once a week, unless there are a LOT of deleted documents to purge. When you optimize an index after every change and it has to be replicated, the entire index will be copied every time. If you do not optimize your index, then replication can copy only the new (or merged) index files, which is usually very very fast.

I believe that DIH does work with SolrCloud, but I have never touched SolrCloud, so I can't say for sure. From what I understand, if you send updates to SolrCloud, it will farm those out to all replicas simultaneously, and those replicas will each index the data independently. The rest of what I am saying will be for 3.5, which is the version that I currently use in production.

I use DIH for full index rebuilds and a SolrJ application for updates. For every one of my index shards, I actually have two cores - a live core and a build core. I do the full-import to the build core, and when they all complete, I index differential data to the build cores, then swap live and build. Here is the solr.xml that I use:

http://www.fpaste.org/hWLF/

You can set up replication such that when you swap cores on the master, the slaves will immediately begin a full replication from the new core. I actually no longer use replication, but once had version 1.4.1 set up this way.

Thanks,
Shawn


Reply via email to