On 12/19/2012 11:50 AM, Mark wrote:
We have a master that has a cron job to run a delta import via DIH every 5
minutes. The delta-import takes around 75 minutes to full complete, most of
that is due to optimization after each delta and then the slaves sync up. Our
index is around 30 gigs so after delta-importing it takes a few minutes to sync
to each slave and causes a huge increase in disk I/O and thus slowing down the
machine to an unusable state. To get around this we have a rolling upgrade
process whereas one slave at a time takes itself offline and then syncs and
then brings itself back up. Gross… i know. When we want to run a full-import,
which could take upwards of 30 hours, we run it on a separate solr master while
the first solr master continues to delta-import. When the staging solr master
is finally done importing we copy over the index to the main solr master which
will then sync up with the slaves. This has been working for us but it
obviously has it flaws.
I've been looking into completely re-writing our architecture to utilize Solr
Cloud to help us with some of these pain points, if it makes sense. Please let
me know how Solr 4.0 and Solr Cloud could help.
I also have the following questions.
Does DIH work with Solr Cloud?
Can Solr Cloud utilize the whole cluster to index in parallel to remove the
burden of one machine from performing that task. If so, how is it balanced
across all nodes? Can this work with DIH
When we decide to run a full-import how can we due this and not affect our existing
cluster since there is no real master/slave and obviously no staging "master"?
If the delta-import takes 75 minutes to complete, you should not be
doing it every five minutes. As I understand it, DIH won't do more than
one import at the same time anyway. If your update code has a lockout
mechanism that will keep it from trying a new import until a previous
one is done, then you're probably OK kicking it off every five minutes.
Also, you should not be optimizing after every import. Other people on
this list will tell you that you should *never* optimize. My opinion on
it is that if you delete or reindex existing documents regularly, you
should optimize on a very long interval. If you never delete or reindex
documents, then optimization is unnecessary. Optimization is very I/O
intensive, as you have likely noticed.
For really large indexes, you probably shouldn't optimize more than once
a week, unless there are a LOT of deleted documents to purge. When you
optimize an index after every change and it has to be replicated, the
entire index will be copied every time. If you do not optimize your
index, then replication can copy only the new (or merged) index files,
which is usually very very fast.
I believe that DIH does work with SolrCloud, but I have never touched
SolrCloud, so I can't say for sure. From what I understand, if you send
updates to SolrCloud, it will farm those out to all replicas
simultaneously, and those replicas will each index the data
independently. The rest of what I am saying will be for 3.5, which is
the version that I currently use in production.
I use DIH for full index rebuilds and a SolrJ application for updates.
For every one of my index shards, I actually have two cores - a live
core and a build core. I do the full-import to the build core, and when
they all complete, I index differential data to the build cores, then
swap live and build. Here is the solr.xml that I use:
http://www.fpaste.org/hWLF/
You can set up replication such that when you swap cores on the master,
the slaves will immediately begin a full replication from the new core.
I actually no longer use replication, but once had version 1.4.1 set up
this way.
Thanks,
Shawn