We have 2 collections with 1 shard each replicated over 5 servers in the cluster. We see a lot of flapping (down or recovering) on one of the collections. When this happens the other collection hosted on the same machine is still marked as active. When this happens it takes a fairly long time (~30 minutes) for the collection to come back online, if at all. I find that its usually more reliable to completely shutdown solr on the affected machine and bring it back up with its core disabled. We then re-enable the core when its marked as active.
A few questions: 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing that marks one collection as down but the other on the same machine as up? 2) Why does recovery take forever when a node goes down.. even if its only down for 30 seconds. Our index is only 7-8G and we are running on SSD's. 3) What can be done to diagnose and fix this problem?