I would not run Zookeeper in a container. That seems like a very bad idea. Each Zookeeper node has an identity. They are not interchangeable.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 31, 2018, at 11:14 AM, Jack Schlederer > <jack.schlede...@directsupply.com> wrote: > > Thanks Erick. After some more testing, I'd like to correct the failure case > we're seeing. It's not when 2 ZK nodes are killed that we have trouble > recovering, but rather when all 3 ZK nodes that came up when the cluster > was initially started get killed at some point. Even if it's one at a time, > and we wait for a new one to spin up and join the cluster before killing > the next one, we get into a bad state when none of the 3 nodes that were in > the cluster initially are there anymore, even though they've been replaced > by our cloud provider spinning up new ZK's. We assign DNS names to the > ZooKeepers as they spin up, with a 10 second TTL, and those are what get > set as the ZK_HOST environment variable on the Solr hosts (i.e., ZK_HOST= > zk1.foo.com:2182,zk2.foo.com:2182,zk3.foo.com:2182). Our working hypothesis > is that Solr's JVM is caching the IP addresses for the ZK hosts' DNS names > when it starts up, and doesn't re-query DNS for some reason when it finds > that that IP address is no longer reachable (i.e., when a ZooKeeper node > dies and spins up at a different IP). Our current trajectory has us finding > a way to assign known static IPs to the ZK nodes upon startup, and > assigning those IPs to the ZK_HOST env var, so we can take DNS lookups out > of the picture entirely. > > We can reproduce this in our cloud environment, as each ZK node has its own > IP and DNS name, but it's difficult to reproduce locally due to all the > ZooKeeper containers having the same IP when running locally (127.0.0.1). > > Please let us know if you have insight into this issue. > > Thanks, > Jack > > On Fri, Aug 31, 2018 at 10:40 AM Erick Erickson <erickerick...@gmail.com> > wrote: > >> Jack: >> >> Is it possible to reproduce "manually"? By that I mean without the >> chaos bit by the following: >> >> - Start 3 ZK nodes >> - Create a multi-node, multi-shard Solr collection. >> - Sequentially stop and start the ZK nodes, waiting for the ZK quorum >> to recover between restarts. >> - Solr does not reconnect to the restarted ZK node and will think it's >> lost quorum after the second node is restarted. >> >> bq. Kill 2, however, and we lose the quorum and we have >> collections/replicas that appear as "gone" on the Solr Admin UI's >> cloud graph display. >> >> It's odd that replicas appear as "gone", and suggests that your ZK >> ensemble is possibly not correctly configured, although exactly how is >> a mystery. Solr pulls it's picture of the topology of the network from >> ZK, establishes watches and the like. For most operations, Solr >> doesn't even ask ZooKeeper for anything since it's picture of the >> cluster is stored locally. ZKs job is to inform the various Solr nodes >> when the topology changes, i.e. _Solr_ nodes change state. For >> querying and indexing, there's no ZK involved at all. Even if _all_ >> ZooKeeper nodes disappear, Solr should still be able to talk to other >> Solr nodes and shouldn't show them as down just because it can't talk >> to ZK. Indeed, querying should be OK although indexing will fail if >> quorum is lost. >> >> But you say you see the restarted ZK nodes rejoin the ZK ensemble, so >> the ZK config seems right. Is there any chance your chaos testing >> "somehow" restarts the ZK nodes with any changes to the configs? >> Shooting in the dark here. >> >> For a replica to be "gone", the host node should _also_ be removed >> form the "live_nodes" znode, Hmmmm. I do wonder if what you're >> observing is a consequence of both killing ZK nodes and Solr nodes. >> I'm not saying this is what _should_ happen, just trying to understand >> what you're reporting. >> >> My theory here is that your chaos testing kills some Solr nodes and >> that fact is correctly propagated to the remaining Solr nodes. Then >> your ZK nodes are killed and somehow Solr doesn't reconnect to ZK >> appropriately so it's picture of the cluster has the node as >> permanently down. Then you restart the Solr node and that information >> isn't propagated to the Solr nodes since they didn't reconnect. If >> that were the case, then I'd expect the admin UI to correctly show the >> state of the cluster when hit on a Solr node that has never been >> restarted. >> >> As you can tell, I'm using something of a scattergun approach here b/c >> this isn't what _should_ happen given what you describe. >> Theoretically, all the ZK nodes should be able to go away and come >> back and Solr reconnect... >> >> As an aside, if you are ever in the code you'll see that for a replica >> to be usable, it must have both the state set to "active" _and_ the >> corresponding node has to be present in the live_nodes ephemeral >> zNode. >> >> Is there any chance you could try the manual steps above (AWS isn't >> necessary here) and let us know what happens? And if we can get a >> reproducible set of steps, feel free to open a JIRA. >> On Thu, Aug 30, 2018 at 10:11 PM Jack Schlederer >> <jack.schlede...@directsupply.com> wrote: >>> >>> We run a 3 node ZK cluster, but I'm not concerned about 2 nodes failing >> at >>> the same time. Our chaos process only kills approximately one node per >>> hour, and our cloud service provider automatically spins up another ZK >> node >>> when one goes down. All 3 ZK nodes are back up within 2 minutes, talking >> to >>> each other and syncing data. It's just that Solr doesn't seem to >> recognize >>> it. We'd have to restart Solr to get it to recognize the new Zookeepers, >>> which we can't do without taking downtime or losing data that's stored on >>> non-persistent disk within the container. >>> >>> The ZK_HOST environment variable lists all 3 ZK nodes. >>> >>> We're running ZooKeeper version 3.4.13. >>> >>> Thanks, >>> Jack >>> >>> On Thu, Aug 30, 2018 at 4:12 PM Walter Underwood <wun...@wunderwood.org> >>> wrote: >>> >>>> How many Zookeeper nodes in your ensemble? You need five nodes to >>>> handle two failures. >>>> >>>> Are your Solr instances started with a zkHost that lists all five >>>> Zookeeper nodes? >>>> >>>> What version of Zookeeper? >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org >>>> http://observer.wunderwood.org/ (my blog) >>>> >>>>> On Aug 30, 2018, at 1:45 PM, Jack Schlederer < >>>> jack.schlede...@directsupply.com> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> My team is attempting to spin up a SolrCloud cluster with an external >>>>> ZooKeeper ensemble. We're trying to engineer our solution to be HA >> and >>>>> fault-tolerant such that we can lose either 1 Solr instance or 1 >>>> ZooKeeper >>>>> and not take downtime. We use chaos engineering to randomly kill >>>> instances >>>>> to test our fault-tolerance. Killing Solr instances seems to be >> solved, >>>> as >>>>> we use a high enough replication factor and Solr's built in >> autoscaling >>>> to >>>>> ensure that new Solr nodes added to the cluster get the replicas that >>>> were >>>>> lost from the killed node. However, ZooKeeper seems to be a different >>>>> story. We can kill 1 ZooKeeper instance and still maintain, and >>>> everything >>>>> is good. It comes back and starts participating in leader elections, >> etc. >>>>> Kill 2, however, and we lose the quorum and we have >> collections/replicas >>>>> that appear as "gone" on the Solr Admin UI's cloud graph display, >> and we >>>>> get Java errors in the log reporting that collections can't be read >> from >>>>> ZK. This means we aren't servicing search requests. We found an open >> JIRA >>>>> that reports this same issue, but its only affected version is >> 5.3.1. We >>>>> are experiencing this problem in 7.3.1. Has there been any progress >> or >>>>> potential workarounds on this issue since? >>>>> >>>>> Thanks, >>>>> Jack >>>>> >>>>> Reference: >>>>> https://issues.apache.org/jira/browse/SOLR-8868 >>>> >>>> >>