Thanks for checking Shawn. So rolling ZK restart is bad, and ZK nodes with different config is bad, Guess this could still work if * All ZK config changes are done by stopping ALL zk nodes * All config changes are done controlled and manual so DC1 don’t come up by surprise with old config
PS: I was not proposing an *automatic* triggering of a reconfiguration script, but rather to have a script that someone runs manually in order to make sure one does not mess up the reconfiguration -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com > 2. jun. 2017 kl. 14.57 skrev Shawn Heisey <apa...@elyograg.org>: > > On 5/29/2017 8:57 AM, Jan Høydahl wrote: >> And if you start all three in DC1, you have 3+3 voting, what would >> then happen? Any chance of state corruption? >> >> I believe that my solution isolates manual change to two ZK nodes in >> DC2, while your requires config change to 1 in DC2 and manual >> start/stop of 1 in DC1. > > I took the scenario to the zookeeper user list. Here's the thread: > > http://zookeeper-user.578899.n2.nabble.com/Yet-another-quot-two-datacenter-quot-discussion-td7583106.html > > I'm not completely clear on what they're saying, but here's what I think > it means: Dealing with a loss of dc1 by reconfiguring ZK servers in DC2 > might work, or it might crash and burn once connectivity to DC1 is restored. > >> Well, that’s not up to me to decide, it’s the customer environment >> that sets the constraints, they currently have 2 independent geo >> locations. And Solr is just a dependency of some other app they need >> to install, so doubt that they are very happy to start adding racks or >> independent power/network for this alone. Of course, if they already >> have such redundancy within one of the DCs, placing a 3rd ZK there is >> an ideal solution with probably good enough HA. If not, I’m looking >> for the 2nd best low-friction approach with software-only. > > Even if all goes well with scripted reconfiguration of DC2, I don't > think I'd want to try and automate it, because of the chance for a brief > outage to trigger it. Without automation, if the failure happened at > just the wrong moment, it could be a while before anyone notices, and it > might be hours after it gets noticed before relevant personnel are in a > position to run the reconfiguration script on DC2, during which you'd > have a read-only SolrCloud. > > Frequently search is such a critical part of of a web applications that > if it doesn't work, there IS no web application. That certainly > describes the systems that use the Solr installations that I manage. > For that kind of application, damage to reputation caused by a couple of > hours where the website doesn't get any updates might be MUCH more > expensive than the monthly cost for a virtual private server from a > hosting company. > > Thanks, > Shawn >