If you can satisfy this statement then it seems possible. This is the same restirction as "atomic updates".: The SolrEntityProcessor can only copy fields that are stored in the source index.
On Wed, Aug 28, 2013 at 9:41 AM, Timothy Potter <thelabd...@gmail.com>wrote: > I've been thinking about this one too and was curious about using the Solr > Entity support in the DIH to do the import from one DC to another (for the > lost docs). In my mind, one configures the DIH to use the > SolrEntityProcessor with a query to capture the docs in the DC that stayed > online, most likely using a timestamp in the query (see: > http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor). > > Would that work? If so, any downsides? I've only used DIH / > SolrEntityProcessor to populate a staging / dev environment from prod but > have had good success with it. > > Thanks. > Tim > > > On Wed, Aug 28, 2013 at 6:59 AM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > The separate DC problem has been lurking for a while. But your > > understanding it a little off. When a replica discovers that > > it's "too far" out of date, it does an old-style replication. IOW, the > > tlog doesn't contain the entire delta. Eventually, the old-style > > replications catch up to "close enough" and _then_ the remaining > > docs in the tlog are replayed. The target number of updates in the > > tlog is 100 so it's a pretty small window that's actually replayed in > > the normal case. > > > > None of which helps your problem. The simplest way (and on the > > expectation that DC outages were pretty rare!) would be to have your > > indexing process fire the missed updates at the DC after it came > > back up. > > > > Copying from one DC to another is tricky. You'd have to be very, > > very sure that you copied indexes to the right shard. Ditto for any > > process that tried to have, say, a single node from the recovering > > DC temporarily join the good DC, at least long enough to synch. > > > > Not a pretty problem, we don't really have any best practices yet > > that I know of. > > > > FWIW, > > Erick > > > > > > On Wed, Aug 28, 2013 at 8:13 AM, Daniel Collins <danwcoll...@gmail.com > > >wrote: > > > > > We have 2 separate data centers in our organisation, and in order to > > > maintain the ZK quorum during any DC outage, we have 2 separate Solr > > > clouds, one in each DC with separate ZK ensembles but both are fed with > > the > > > same indexing data. > > > > > > Now in the event of a DC outage, all our Solr instances go down, and > when > > > they come back up, we need some way to recover the "lost" data. > > > > > > Our thought was to replicate from the working DC, but is there a way to > > do > > > that whilst still maintaining an "online" presence for indexing > purposes? > > > > > > In essence, we want to do what happens within Solr cloud's recovery, so > > (as > > > I understand cloud recovery) a node starts up, (I'm assuming worst case > > and > > > peer sync has failed) then buffers all updates into the transaction > log, > > > replicates from the leader, and replays the transaction log to get > > > everything in sync. > > > > > > Is it conceivable to do the same by extending Solr, so on the > activation > > of > > > some handler (user triggered), we initiated a "replicate from other > DC", > > > which puts all the leaders into buffering updates, replicate from some > > > other set of servers and then replay? > > > > > > Our goal is to try to minimize the downtime (beyond the initial > outage), > > so > > > we would ideally like to be able to start up indexing before this > > > "replicate/clone" has finished, that's why I thought to enable > buffering > > on > > > the transaction log. Searches shouldn't be sent here, but if they do > we > > > have a valid (albeit old) index to serve those until the new one swaps > > in. > > > > > > Just curious how any other DC-aware setups handle this kind of > scenario? > > > Or other concerns, issues with this type of approach. > > > > > >