Hi, We have been testing an installation of SolrCloud under some failure scenarios, and are seeing some issues we would like to fix before putting this into production.
Our cluster is 6 servers running Solr 5.4.1, with config stored in our Zookeeper cluster. Our cores currently each have a single shard replicated across all servers. Scenario 1: We start a full import from our database using the dataimport handler. During the import we do a clean shutdown of the node running the import. When the node is started again, it comes up with a partial index. The index is not resynced from the leader until we start and complete a new full import. Are we missing some settings that will make updates atomic? We would rather roll back the update than run with a partial set of documents. How can we make replicas stay in sync with the leader? Scenario 2: One of our servers had a disk error that made the Solr home directory turn read-only. On the cores where this node was a follower the node was correctly marked as down. But on one core where this node was a leader, it stayed healthy. All updates would fail, without the node realizing it should step down as leader. In addition, leader elections stalled while this node was in the cluster. When a second server was shut down, several cores stayed leaderless until the node with the failed disk was shut down as well. Is there a way to healthcheck nodes so a disk failure will make the affected node step down? Scenario 3: We changed the faulty disk, wiping the Solr home directory. Starting Solr again did not resync the missing cores. I do see some lines in our logs like: 2016-02-16 13:44:02.841 INFO (qtp1395089624-22) [ ] o.a.s.h.a.CoreAdminHandler It has been requested that we recover: core=content_shard1_replica1 2016-02-16 13:44:02.842 ERROR (Thread-15) [ ] o.a.s.h.a.CoreAdminHandler Could not find core to call recovery:content_shard1_replica1 Is there a way to force recovery of the cores a node should have based on the collection replica settings? Any tips on how to make this more robust would be appreciated. Regards, Håkon Hitland