The separate DC problem has been lurking for a while. But your
understanding it a little off. When a replica discovers that
it's "too far" out of date, it does an old-style replication. IOW, the
tlog doesn't contain the entire delta. Eventually, the old-style
replications catch up to "close enough" and _then_ the remaining
docs in the tlog are replayed. The target number of updates in the
tlog is 100 so it's a pretty small window that's actually replayed in
the normal case.

None of which helps your problem. The simplest way (and on the
expectation that DC outages were pretty rare!) would be to have your
indexing process fire the missed updates at the DC after it came
back up.

Copying from one DC to another is tricky. You'd have to be very,
very sure that you copied indexes to the right shard. Ditto for any
process that tried to have, say, a single node from the recovering
DC temporarily join the good DC, at least long enough to synch.

Not a pretty problem, we don't really have any best practices yet
that I know of.

FWIW,
Erick


On Wed, Aug 28, 2013 at 8:13 AM, Daniel Collins <danwcoll...@gmail.com>wrote:

> We have 2 separate data centers in our organisation, and in order to
> maintain the ZK quorum during any DC outage, we have 2 separate Solr
> clouds, one in each DC with separate ZK ensembles but both are fed with the
> same indexing data.
>
> Now in the event of a DC outage, all our Solr instances go down, and when
> they come back up, we need some way to recover the "lost" data.
>
> Our thought was to replicate from the working DC, but is there a way to do
> that whilst still maintaining an "online" presence for indexing purposes?
>
> In essence, we want to do what happens within Solr cloud's recovery, so (as
> I understand cloud recovery) a node starts up, (I'm assuming worst case and
> peer sync has failed) then buffers all updates into the transaction log,
> replicates from the leader, and replays the transaction log to get
> everything in sync.
>
> Is it conceivable to do the same by extending Solr, so on the activation of
> some handler (user triggered), we initiated a "replicate from other DC",
> which puts all the leaders into buffering updates, replicate from some
> other set of servers and then replay?
>
> Our goal is to try to minimize the downtime (beyond the initial outage), so
> we would ideally like to be able to start up indexing before this
> "replicate/clone" has finished, that's why I thought to enable buffering on
> the transaction log.  Searches shouldn't be sent here, but if they do we
> have a valid (albeit old) index to serve those until the new one swaps in.
>
> Just curious how any other DC-aware setups handle this kind of scenario?
>  Or other concerns, issues with this type of approach.
>

Reply via email to