On 2/8/2016 1:09 PM, Kelly, Frank wrote:
> We are running a small SolrCloud instance on AWS
>
> Solr : Version 5.3.1
> ZooKeeper: Version 3.4.6
>
> 3 x ZooKeeper nodes (with higher limits and timeouts due to being on AWS)
> 3 x Solr Nodes (8 GB of memory each – 2 collections with 3 shards for
> each collection)
>
> Let’s call the ZooKeeper nodes A, B and C.
> One of our ZooKeeper nodes (B) failed a health check and was replaced
> due to autoscaling , but during this time of failover
> our SolrCloud cluster became unavailable. All new connections to Solr
> were unable to connect complaining about connectivity issues
> and preexisting connections also had errors
>
<snip>
> I thought because we had configured SolrCloud to point at all three ZK
> nodes that the failure of one ZK node would be OK (since we still had
> a quorum).
>  Did I misunderstand something about SolrCloud and its relationship
> with ZK?

That's supposed to be how Zookeeper and SolrCloud work, if everything is
configured properly and has full network connectivity.

What is your zkHost string for Solr?  Is the zkHost value the same on
all three SolrCloud nodes?  It should be identical on all of them, and
every server should be able to directly reach every other server on all
relevant ports.

> The weird thing now is that when the new ZooKeeper node (D) started up
> – after a few minutes we could connect to SolrCloud again even though
> we were still only pointing to A,B and C (not D).
> Any thoughts on why this also happened?

This sounds odd.

The exceptions that you outlined are from *client* code
(CloudSolrClient), not the Solr servers.  CloudSolrClient instances
should normally be constructed using the same zkHost string that your
Solr servers use, listing all of the zookeeper servers.  Is this how
they are set up?

I am unsure how all this might be affected by the internal/external
addressing that AWS uses.

Thanks,
Shawn

Reply via email to