Here is a really different approach. Make the two data centers one Solr Cloud cluster and use a third data center (or EC2 region) for one additional Zookeeper node. When you lose a DC, Zookeeper still functions.
There would be more traffic between datacenters. wunder On Aug 29, 2013, at 4:11 AM, Erick Erickson wrote: > Yeah, reality gets in the way of simple solutions a lot..... > > And making it even more fun you'd really want to only > bring up one node for each shard in the broken DC and > let that one be fully synched. Then bring up the replicas > in a controlled fashion so you didn't saturate the local > network with replications. And then you'd..... > > But as Shawn says, this is certainly functionality that > would be waaay cool, there's just been no time to > make it all work, the main folks who've been working > in this area all have a mountain of higher-priority > stuff to get done first.... > > There's been talk of making SolrCloud "rack aware" which > could extend into some kind of work in this area, but > that's also on the "future" plate. As you're well aware > it's not a trivial problem! > > Hmmm, what you really want here is the ability to say > to a recovering cluster "do your initial synch using nodes > that the ZK ensemble located at XXX know about, then > switch to your very own ensemble". Something like a > "remote recovery" option..... Which is _still_ kind of > tricky, I sure hope you have identical sharding schemes..... > > FWIW, > Erick > > > On Wed, Aug 28, 2013 at 1:12 PM, Shawn Heisey <s...@elyograg.org> wrote: > >> On 8/28/2013 10:48 AM, Daniel Collins wrote: >> >>> What ideally I would like to do >>> is at the point that I kick off recovery, divert the indexing feed for the >>> "broken" into a transaction log on those machines, run the replication and >>> swap the index in, then replay the transaction log to bring it all up to >>> date. That process (conceptually) is the same as the >>> org.apache.solr.cloud.**RecoveryStrategy code. >>> >> >> I don't think any such mechanism exists currently. It would be extremely >> awesome if it did. If there's not an existing Jira issue, I recommend that >> you file one. Being able to set up a multi-datacenter cloud with automatic >> recovery would be awesome. Even if it took a long time, having it be fully >> automated would be exceptionally useful. >> >> >> Yes, if I could divert that feed a that application level, then I can do >>> what you suggest, but it feels like more work to do that (and build an >>> external transaction log) whereas the code seems to already be in Solr >>> itself, I just need to hook it all up (famous last words!) Our indexing >>> pipeline does a lot of pre-processing work (its not just pulling data from >>> a database), and since we are only talking about the time taken to do the >>> replication (should be an hour or less), it feels like we ought to be able >>> to store that in a Solr transaction log (i.e. the last point in the >>> indexing pipeline). >>> >> >> I think it would have to be a separate transaction log. One problem with >> really big regular tlogs is that when Solr gets restarted, the entire >> transaction log that's currently on the disk gets replayed. If it were big >> enough to recover the last several hours to a duplicate cloud, it would >> take forever to replay on Solr restart. If the regular tlog were kept >> small but a second log with the last 24 hours were available, it could >> replay updates when the second cloud came back up. >> >> I do import from a database, so the application-level tracking works >> really well for me. >> >> Thanks, >> Shawn >> >> -- Walter Underwood wun...@wunderwood.org