We run a 3 node ZK cluster, but I'm not concerned about 2 nodes failing at the same time. Our chaos process only kills approximately one node per hour, and our cloud service provider automatically spins up another ZK node when one goes down. All 3 ZK nodes are back up within 2 minutes, talking to each other and syncing data. It's just that Solr doesn't seem to recognize it. We'd have to restart Solr to get it to recognize the new Zookeepers, which we can't do without taking downtime or losing data that's stored on non-persistent disk within the container.
The ZK_HOST environment variable lists all 3 ZK nodes. We're running ZooKeeper version 3.4.13. Thanks, Jack On Thu, Aug 30, 2018 at 4:12 PM Walter Underwood <wun...@wunderwood.org> wrote: > How many Zookeeper nodes in your ensemble? You need five nodes to > handle two failures. > > Are your Solr instances started with a zkHost that lists all five > Zookeeper nodes? > > What version of Zookeeper? > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Aug 30, 2018, at 1:45 PM, Jack Schlederer < > jack.schlede...@directsupply.com> wrote: > > > > Hi all, > > > > My team is attempting to spin up a SolrCloud cluster with an external > > ZooKeeper ensemble. We're trying to engineer our solution to be HA and > > fault-tolerant such that we can lose either 1 Solr instance or 1 > ZooKeeper > > and not take downtime. We use chaos engineering to randomly kill > instances > > to test our fault-tolerance. Killing Solr instances seems to be solved, > as > > we use a high enough replication factor and Solr's built in autoscaling > to > > ensure that new Solr nodes added to the cluster get the replicas that > were > > lost from the killed node. However, ZooKeeper seems to be a different > > story. We can kill 1 ZooKeeper instance and still maintain, and > everything > > is good. It comes back and starts participating in leader elections, etc. > > Kill 2, however, and we lose the quorum and we have collections/replicas > > that appear as "gone" on the Solr Admin UI's cloud graph display, and we > > get Java errors in the log reporting that collections can't be read from > > ZK. This means we aren't servicing search requests. We found an open JIRA > > that reports this same issue, but its only affected version is 5.3.1. We > > are experiencing this problem in 7.3.1. Has there been any progress or > > potential workarounds on this issue since? > > > > Thanks, > > Jack > > > > Reference: > > https://issues.apache.org/jira/browse/SOLR-8868 > >