Re: Solr Cloud Architecture and DIH

Shawn Heisey Wed, 19 Dec 2012 16:03:42 -0800

On 12/19/2012 11:50 AM, Mark wrote:

We have a master that has a cron job to run a delta import via DIH every 5 
minutes. The delta-import  takes around 75 minutes to full complete, most of 
that is due to optimization after each delta and then the slaves sync up. Our 
index is around 30 gigs so after delta-importing it takes a few minutes to sync 
to each slave and causes a huge increase in disk I/O and thus slowing down the 
machine to an unusable state. To get around this we have a rolling upgrade 
process whereas one slave at a time takes itself offline and then syncs and 
then brings itself back up. Gross… i know. When we want to run a full-import, 
which could take upwards of 30 hours, we run it on a separate solr master while 
the first solr master continues to delta-import. When the staging solr master 
is finally done importing we copy over the index to the main solr master which 
will then sync up with the slaves. This has been working for us but it 
obviously has it flaws.


I've been looking into completely re-writing our architecture to utilize Solr 
Cloud to help us with some of these pain points, if it makes sense. Please let 
me know how Solr 4.0 and Solr Cloud could help.

I also have the following questions.
Does DIH work with Solr Cloud?
Can Solr Cloud utilize the whole cluster to index in parallel to remove the 
burden of one machine from performing that task. If so, how is it balanced 
across all nodes? Can this work with DIH
When we decide to run a full-import how can we due this and not affect our existing 
cluster since there is no real master/slave and obviously no staging "master"?

If the delta-import takes 75 minutes to complete, you should not bedoing it every five minutes. As I understand it, DIH won't do more thanone import at the same time anyway. If your update code has a lockoutmechanism that will keep it from trying a new import until a previousone is done, then you're probably OK kicking it off every five minutes.

Also, you should not be optimizing after every import. Other people onthis list will tell you that you should *never* optimize. My opinion onit is that if you delete or reindex existing documents regularly, youshould optimize on a very long interval. If you never delete or reindexdocuments, then optimization is unnecessary. Optimization is very I/Ointensive, as you have likely noticed.

For really large indexes, you probably shouldn't optimize more than oncea week, unless there are a LOT of deleted documents to purge. When youoptimize an index after every change and it has to be replicated, theentire index will be copied every time. If you do not optimize yourindex, then replication can copy only the new (or merged) index files,which is usually very very fast.

I believe that DIH does work with SolrCloud, but I have never touchedSolrCloud, so I can't say for sure. From what I understand, if you sendupdates to SolrCloud, it will farm those out to all replicassimultaneously, and those replicas will each index the dataindependently. The rest of what I am saying will be for 3.5, which isthe version that I currently use in production.

I use DIH for full index rebuilds and a SolrJ application for updates.For every one of my index shards, I actually have two cores - a livecore and a build core. I do the full-import to the build core, and whenthey all complete, I index differential data to the build cores, thenswap live and build. Here is the solr.xml that I use:


http://www.fpaste.org/hWLF/

You can set up replication such that when you swap cores on the master,the slaves will immediately begin a full replication from the new core.I actually no longer use replication, but once had version 1.4.1 set upthis way.


Thanks,
Shawn

Re: Solr Cloud Architecture and DIH

Reply via email to