Someone really needs to test this with EC2 availability zones. I haven't had the time, but I know other clustered NoSQL solutions like HBase and Cassandra can deal with it.
Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions <https://twitter.com/Appinions> | g+: plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts> w: appinions.com <http://www.appinions.com/> On Thu, Aug 29, 2013 at 12:20 PM, Walter Underwood <wun...@wunderwood.org>wrote: > Here is a really different approach. > > Make the two data centers one Solr Cloud cluster and use a third data > center (or EC2 region) for one additional Zookeeper node. When you lose a > DC, Zookeeper still functions. > > There would be more traffic between datacenters. > > wunder > > On Aug 29, 2013, at 4:11 AM, Erick Erickson wrote: > > > Yeah, reality gets in the way of simple solutions a lot..... > > > > And making it even more fun you'd really want to only > > bring up one node for each shard in the broken DC and > > let that one be fully synched. Then bring up the replicas > > in a controlled fashion so you didn't saturate the local > > network with replications. And then you'd..... > > > > But as Shawn says, this is certainly functionality that > > would be waaay cool, there's just been no time to > > make it all work, the main folks who've been working > > in this area all have a mountain of higher-priority > > stuff to get done first.... > > > > There's been talk of making SolrCloud "rack aware" which > > could extend into some kind of work in this area, but > > that's also on the "future" plate. As you're well aware > > it's not a trivial problem! > > > > Hmmm, what you really want here is the ability to say > > to a recovering cluster "do your initial synch using nodes > > that the ZK ensemble located at XXX know about, then > > switch to your very own ensemble". Something like a > > "remote recovery" option..... Which is _still_ kind of > > tricky, I sure hope you have identical sharding schemes..... > > > > FWIW, > > Erick > > > > > > On Wed, Aug 28, 2013 at 1:12 PM, Shawn Heisey <s...@elyograg.org> wrote: > > > >> On 8/28/2013 10:48 AM, Daniel Collins wrote: > >> > >>> What ideally I would like to do > >>> is at the point that I kick off recovery, divert the indexing feed for > the > >>> "broken" into a transaction log on those machines, run the replication > and > >>> swap the index in, then replay the transaction log to bring it all up > to > >>> date. That process (conceptually) is the same as the > >>> org.apache.solr.cloud.**RecoveryStrategy code. > >>> > >> > >> I don't think any such mechanism exists currently. It would be > extremely > >> awesome if it did. If there's not an existing Jira issue, I recommend > that > >> you file one. Being able to set up a multi-datacenter cloud with > automatic > >> recovery would be awesome. Even if it took a long time, having it be > fully > >> automated would be exceptionally useful. > >> > >> > >> Yes, if I could divert that feed a that application level, then I can do > >>> what you suggest, but it feels like more work to do that (and build an > >>> external transaction log) whereas the code seems to already be in Solr > >>> itself, I just need to hook it all up (famous last words!) Our indexing > >>> pipeline does a lot of pre-processing work (its not just pulling data > from > >>> a database), and since we are only talking about the time taken to do > the > >>> replication (should be an hour or less), it feels like we ought to be > able > >>> to store that in a Solr transaction log (i.e. the last point in the > >>> indexing pipeline). > >>> > >> > >> I think it would have to be a separate transaction log. One problem > with > >> really big regular tlogs is that when Solr gets restarted, the entire > >> transaction log that's currently on the disk gets replayed. If it were > big > >> enough to recover the last several hours to a duplicate cloud, it would > >> take forever to replay on Solr restart. If the regular tlog were kept > >> small but a second log with the last 24 hours were available, it could > >> replay updates when the second cloud came back up. > >> > >> I do import from a database, so the application-level tracking works > >> really well for me. > >> > >> Thanks, > >> Shawn > >> > >> > > -- > Walter Underwood > wun...@wunderwood.org > > > >