Re: Data Centre recovery/replication, does this seem plausible?

Walter Underwood Thu, 29 Aug 2013 09:23:42 -0700

Here is a really different approach.

Make the two data centers one Solr Cloud cluster and use a third data center 
(or EC2 region) for one additional Zookeeper node. When you lose a DC, 
Zookeeper still functions.


There would be more traffic between datacenters.

wunder

On Aug 29, 2013, at 4:11 AM, Erick Erickson wrote:

> Yeah, reality gets in the way of simple solutions a lot.....
> 
> And making it even more fun you'd really want to only
> bring up one node for each shard in the broken DC and
> let that one be fully synched. Then bring up the replicas
> in a controlled fashion so you didn't saturate the local
> network with replications. And then you'd.....
> 
> But as Shawn says, this is certainly functionality that
> would be waaay cool, there's just been no time to
> make it all work, the main folks who've been working
> in this area all have a mountain of higher-priority
> stuff to get done first....
> 
> There's been talk of making SolrCloud "rack aware" which
> could extend into some kind of work in this area, but
> that's also on the "future" plate. As you're well aware
> it's not a trivial problem!
> 
> Hmmm, what you really want here is the ability to say
> to a recovering cluster "do your initial synch using nodes
> that the ZK ensemble located at XXX know about, then
> switch to your very own ensemble". Something like a
> "remote recovery" option..... Which is _still_ kind of
> tricky, I sure hope you have identical sharding schemes.....
> 
> FWIW,
> Erick
> 
> 
> On Wed, Aug 28, 2013 at 1:12 PM, Shawn Heisey <s...@elyograg.org> wrote:
> 
>> On 8/28/2013 10:48 AM, Daniel Collins wrote:
>> 
>>> What ideally I would like to do
>>> is at the point that I kick off recovery, divert the indexing feed for the
>>> "broken" into a transaction log on those machines, run the replication and
>>> swap the index in, then replay the transaction log to bring it all up to
>>> date.  That process (conceptually)  is the same as the
>>> org.apache.solr.cloud.**RecoveryStrategy code.
>>> 
>> 
>> I don't think any such mechanism exists currently.  It would be extremely
>> awesome if it did.  If there's not an existing Jira issue, I recommend that
>> you file one.  Being able to set up a multi-datacenter cloud with automatic
>> recovery would be awesome.  Even if it took a long time, having it be fully
>> automated would be exceptionally useful.
>> 
>> 
>> Yes, if I could divert that feed a that application level, then I can do
>>> what you suggest, but it feels like more work to do that (and build an
>>> external transaction log) whereas the code seems to already be in Solr
>>> itself, I just need to hook it all up (famous last words!) Our indexing
>>> pipeline does a lot of pre-processing work (its not just pulling data from
>>> a database), and since we are only talking about the time taken to do the
>>> replication (should be an hour or less), it feels like we ought to be able
>>> to store that in a Solr transaction log (i.e. the last point in the
>>> indexing pipeline).
>>> 
>> 
>> I think it would have to be a separate transaction log.  One problem with
>> really big regular tlogs is that when Solr gets restarted, the entire
>> transaction log that's currently on the disk gets replayed.  If it were big
>> enough to recover the last several hours to a duplicate cloud, it would
>> take forever to replay on Solr restart.  If the regular tlog were kept
>> small but a second log with the last 24 hours were available, it could
>> replay updates when the second cloud came back up.
>> 
>> I do import from a database, so the application-level tracking works
>> really well for me.
>> 
>> Thanks,
>> Shawn
>> 
>> 

--
Walter Underwood
wun...@wunderwood.org

Re: Data Centre recovery/replication, does this seem plausible?

Reply via email to