On 5/24/2017 4:14 PM, Jan Høydahl wrote:
> Sure, ZK does by design not support a two-node/two-location setup. But still, 
> users may want/need to deploy that,
> and my question was if there are smart ways to make such a setup as little 
> painful as possible in case of failure.
>
> Take the example of DC1: 3xZK and DC2: 2xZK again. And then DC1 goes BOOM.
> Without an active action DC2 would be read-only
> What if then the Ops personnel in DC2 could, with a single script/command, 
> instruct DC2 to resume “master” role:
> - Add a 3rd DC2 ZK to the two existing, reconfigure and let them sync up.
> - Rolling restart of Solr nodes with new ZK_HOST string
> Of course, they would also then need to make sure that DC1 does not boot up 
> again before compatible change has been done there too.

When ZK 3.5 comes out and SolrCloud is updated to use it, I think that
it might be possible to remove the dc1 servers from the ensemble and add
another server in dc2 to re-form a new quorum, without restarting
anything.  It could be quite some time before a stable 3.5 is available,
based on past release history.  They don't release anywhere near as
often as Lucene/Solr does.

With the current ZK version, I think your procedure would work, but I
definitely wouldn't call it painless.  Indexing would be unavailable
when dc1 goes down, and everything could be down while the restarts are
happening.

Whether ZK 3.5 is there or not, there is potential unknown behavior when
dc1 comes back online, unless you can have dc1 personnel shut the
servers down, or block communication between your servers in dc1 and dc2.

Overall, having one or two ZK servers in each main DC and a tiebreaker
ZK on a low-cost server in a third DC seems like a better option. 
There's no intervention required when a DC goes down, or when it comes
back up.

Thanks,
Shawn

Reply via email to