Hi,

Our Solr cluster is running VMs that could freeze for more than the ZK tick
time (it's a non critical CI/CD pipeline running on an overloaded ESX).
When this happens the node's shards will be registered as down. Then when
the node is back recovery takes place, and all shards replicas end up
active state. Everyone is happy.

However, we noticed that recover doesn't take place if the collection was
reloaded and the server didn't restart since. Shards end up in done state.
Before providing log messages, I wonder if this is a known issue?

Reproducing recipe (assume two nodes):
1. Before starting: restart both solr1 and solr2: all shards are active.
2. Reload the collection
3. Cause disconnect by freezing the Java process:
On Solr2: kill -SIGSTOP <solr server pid> and then in 2 min kill -SIGCONT
<solr server pid>
4. solr2 shard replicas are *Down *forever. No recovery.

If we omit step #2, the cluster recovers as expected.

Reply via email to