SolrCloud sync issues under server failure

Håkon Hitland Wed, 17 Feb 2016 10:50:36 -0800

Hi,

We have been testing an installation of SolrCloud under some failure
scenarios, and are seeing some issues we would like to fix before putting
this into production.


Our cluster is 6 servers running Solr 5.4.1, with config stored in our
Zookeeper cluster.
Our cores currently each have a single shard replicated across all servers.

Scenario 1:
We start a full import from our database using the dataimport handler.
During the import we do a clean shutdown of the node running the import.

When the node is started again, it comes up with a partial index. The index
is not resynced from the leader until we start and complete a new full
import.

Are we missing some settings that will make updates atomic? We would rather
roll back the update than run with a partial set of documents.
How can we make replicas stay in sync with the leader?

Scenario 2:
One of our servers had a disk error that made the Solr home directory turn
read-only.
On the cores where this node was a follower the node was correctly marked
as down.
But on one core where this node was a leader, it stayed healthy. All
updates would fail, without the node realizing it should step down as
leader.

In addition, leader elections stalled while this node was in the cluster.
When a second server was shut down, several cores stayed leaderless until
the node with the failed disk was shut down as well.

Is there a way to healthcheck nodes so a disk failure will make the
affected node step down?

Scenario 3:
We changed the faulty disk, wiping the Solr home directory.
Starting Solr again did not resync the missing cores.

I do see some lines in our logs like:
2016-02-16 13:44:02.841 INFO (qtp1395089624-22) [ ]
o.a.s.h.a.CoreAdminHandler It has been requested that we recover:
core=content_shard1_replica1
2016-02-16 13:44:02.842 ERROR (Thread-15) [ ] o.a.s.h.a.CoreAdminHandler
Could not find core to call recovery:content_shard1_replica1

Is there a way to force recovery of the cores a node should have based on
the collection replica settings?


Any tips on how to make this more robust would be appreciated.

Regards,
Håkon Hitland

SolrCloud sync issues under server failure

Reply via email to