RE: Data Centre recovery/replication, does this seem plausible?

Markus Jelsma Wed, 28 Aug 2013 06:48:03 -0700

Hi - You're going to miss unstored but indexed fields.  We stop any indexing 
process, kill the servlets on the down DC and copy over the files using scp, 
then remove the lock file and start it up again. Always works but it's a manual 
process at this point but should be easy to automate using some simple bash 
scripting.


-----Original message-----
> From:Timothy Potter <thelabd...@gmail.com>
> Sent: Wednesday 28th August 2013 15:41
> To: solr-user@lucene.apache.org
> Subject: Re: Data Centre recovery/replication, does this seem plausible?
> 
> I've been thinking about this one too and was curious about using the Solr
> Entity support in the DIH to do the import from one DC to another (for the
> lost docs). In my mind, one configures the DIH to use the
> SolrEntityProcessor with a query to capture the docs in the DC that stayed
> online, most likely using a timestamp in the query (see:
> http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor).
> 
> Would that work? If so, any downsides? I've only used DIH /
> SolrEntityProcessor to populate a staging / dev environment from prod but
> have had good success with it.
> 
> Thanks.
> Tim
> 
> 
> On Wed, Aug 28, 2013 at 6:59 AM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
> 
> > The separate DC problem has been lurking for a while. But your
> > understanding it a little off. When a replica discovers that
> > it's "too far" out of date, it does an old-style replication. IOW, the
> > tlog doesn't contain the entire delta. Eventually, the old-style
> > replications catch up to "close enough" and _then_ the remaining
> > docs in the tlog are replayed. The target number of updates in the
> > tlog is 100 so it's a pretty small window that's actually replayed in
> > the normal case.
> >
> > None of which helps your problem. The simplest way (and on the
> > expectation that DC outages were pretty rare!) would be to have your
> > indexing process fire the missed updates at the DC after it came
> > back up.
> >
> > Copying from one DC to another is tricky. You'd have to be very,
> > very sure that you copied indexes to the right shard. Ditto for any
> > process that tried to have, say, a single node from the recovering
> > DC temporarily join the good DC, at least long enough to synch.
> >
> > Not a pretty problem, we don't really have any best practices yet
> > that I know of.
> >
> > FWIW,
> > Erick
> >
> >
> > On Wed, Aug 28, 2013 at 8:13 AM, Daniel Collins <danwcoll...@gmail.com
> > >wrote:
> >
> > > We have 2 separate data centers in our organisation, and in order to
> > > maintain the ZK quorum during any DC outage, we have 2 separate Solr
> > > clouds, one in each DC with separate ZK ensembles but both are fed with
> > the
> > > same indexing data.
> > >
> > > Now in the event of a DC outage, all our Solr instances go down, and when
> > > they come back up, we need some way to recover the "lost" data.
> > >
> > > Our thought was to replicate from the working DC, but is there a way to
> > do
> > > that whilst still maintaining an "online" presence for indexing purposes?
> > >
> > > In essence, we want to do what happens within Solr cloud's recovery, so
> > (as
> > > I understand cloud recovery) a node starts up, (I'm assuming worst case
> > and
> > > peer sync has failed) then buffers all updates into the transaction log,
> > > replicates from the leader, and replays the transaction log to get
> > > everything in sync.
> > >
> > > Is it conceivable to do the same by extending Solr, so on the activation
> > of
> > > some handler (user triggered), we initiated a "replicate from other DC",
> > > which puts all the leaders into buffering updates, replicate from some
> > > other set of servers and then replay?
> > >
> > > Our goal is to try to minimize the downtime (beyond the initial outage),
> > so
> > > we would ideally like to be able to start up indexing before this
> > > "replicate/clone" has finished, that's why I thought to enable buffering
> > on
> > > the transaction log.  Searches shouldn't be sent here, but if they do we
> > > have a valid (albeit old) index to serve those until the new one swaps
> > in.
> > >
> > > Just curious how any other DC-aware setups handle this kind of scenario?
> > >  Or other concerns, issues with this type of approach.
> > >
> >
>

RE: Data Centre recovery/replication, does this seem plausible?

Reply via email to