Hi,

We have been testing an installation of SolrCloud under some failure
scenarios, and are seeing some issues we would like to fix before putting
this into production.

Our cluster is 6 servers running Solr 5.4.1, with config stored in our
Zookeeper cluster.
Our cores currently each have a single shard replicated across all servers.

Scenario 1:
We start a full import from our database using the dataimport handler.
During the import we do a clean shutdown of the node running the import.

When the node is started again, it comes up with a partial index. The index
is not resynced from the leader until we start and complete a new full
import.

Are we missing some settings that will make updates atomic? We would rather
roll back the update than run with a partial set of documents.
How can we make replicas stay in sync with the leader?

Scenario 2:
One of our servers had a disk error that made the Solr home directory turn
read-only.
On the cores where this node was a follower the node was correctly marked
as down.
But on one core where this node was a leader, it stayed healthy. All
updates would fail, without the node realizing it should step down as
leader.

In addition, leader elections stalled while this node was in the cluster.
When a second server was shut down, several cores stayed leaderless until
the node with the failed disk was shut down as well.

Is there a way to healthcheck nodes so a disk failure will make the
affected node step down?

Scenario 3:
We changed the faulty disk, wiping the Solr home directory.
Starting Solr again did not resync the missing cores.

I do see some lines in our logs like:
2016-02-16 13:44:02.841 INFO (qtp1395089624-22) [ ]
o.a.s.h.a.CoreAdminHandler It has been requested that we recover:
core=content_shard1_replica1
2016-02-16 13:44:02.842 ERROR (Thread-15) [ ] o.a.s.h.a.CoreAdminHandler
Could not find core to call recovery:content_shard1_replica1

Is there a way to force recovery of the cores a node should have based on
the collection replica settings?


Any tips on how to make this more robust would be appreciated.

Regards,
Håkon Hitland

Reply via email to