Hi all,

My team is attempting to spin up a SolrCloud cluster with an external
ZooKeeper ensemble. We're trying to engineer our solution to be HA and
fault-tolerant such that we can lose either 1 Solr instance or 1 ZooKeeper
and not take downtime. We use chaos engineering to randomly kill instances
to test our fault-tolerance. Killing Solr instances seems to be solved, as
we use a high enough replication factor and Solr's built in autoscaling to
ensure that new Solr nodes added to the cluster get the replicas that were
lost from the killed node. However, ZooKeeper seems to be a different
story. We can kill 1 ZooKeeper instance and still maintain, and everything
is good. It comes back and starts participating in leader elections, etc.
Kill 2, however, and we lose the quorum and we have collections/replicas
that appear as "gone" on the Solr Admin UI's cloud graph display, and we
get Java errors in the log reporting that collections can't be read from
ZK. This means we aren't servicing search requests. We found an open JIRA
that reports this same issue, but its only affected version is 5.3.1. We
are experiencing this problem in 7.3.1. Has there been any progress or
potential workarounds on this issue since?

Thanks,
Jack

Reference:
https://issues.apache.org/jira/browse/SOLR-8868

Reply via email to