We have 2 collections with 1 shard each replicated over 5 servers in the
cluster. We see a lot of flapping (down or recovering) on one of the
collections. When this happens the other collection hosted on the same
machine is still marked as active. When this happens it takes a fairly long
time (~30 minutes) for the collection to come back online, if at all. I
find that its usually more reliable to completely shutdown solr on the
affected machine and bring it back up with its core disabled. We then
re-enable the core when its marked as active.

A few questions:

1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
that marks one collection as down but the other on the same machine as up?

2) Why does recovery take forever when a node goes down.. even if its only
down for 30 seconds. Our index is only 7-8G and we are running on SSD's.

3) What can be done to diagnose and fix this problem?

Reply via email to