Hi all, My team is attempting to spin up a SolrCloud cluster with an external ZooKeeper ensemble. We're trying to engineer our solution to be HA and fault-tolerant such that we can lose either 1 Solr instance or 1 ZooKeeper and not take downtime. We use chaos engineering to randomly kill instances to test our fault-tolerance. Killing Solr instances seems to be solved, as we use a high enough replication factor and Solr's built in autoscaling to ensure that new Solr nodes added to the cluster get the replicas that were lost from the killed node. However, ZooKeeper seems to be a different story. We can kill 1 ZooKeeper instance and still maintain, and everything is good. It comes back and starts participating in leader elections, etc. Kill 2, however, and we lose the quorum and we have collections/replicas that appear as "gone" on the Solr Admin UI's cloud graph display, and we get Java errors in the log reporting that collections can't be read from ZK. This means we aren't servicing search requests. We found an open JIRA that reports this same issue, but its only affected version is 5.3.1. We are experiencing this problem in 7.3.1. Has there been any progress or potential workarounds on this issue since?
Thanks, Jack Reference: https://issues.apache.org/jira/browse/SOLR-8868