While indexing some documents to a SolrCloud cluster (10 machines, 5 shards
and 2 replicas, so one replica on each machine) one of the replicas stopped
receiving documents, while the other replica of the shard continued to grow.

That was overnight so I was unable to track exactly what happened (I'm
going off our Graphite graphs here). This morning when I was able to look
at the cluster both replicas of that shard were marked as down (with one
marked as leader). I attempted to restart the non-leader node but it took a
long time to restart so I killed it and restarted the old leader, which
also took a long time. I killed that one (I'm impatient) and left the
non-leader node to restart, not realising it was missing approximately 700k
documents that the old leader had. Eventually it restarted and became
leader. I restarted the old leader and it dropped the number of documents
it had to match the previous non-leader.

Is this expected behaviour when a replica with fewer documents is started
before the other and elected leader? Should I have been paying more
attention to the number of documents on the server before restarting nodes?

I am still in the process of tuning the caches and warming for these
servers but we are putting some load through the cluster so it is possible
that the nodes are having to work quite hard when a new version of the core
comes is made available. Is this likely to explain why I occasionally see
nodes dropping out? Unfortunately in restarting the nodes I lost the GC
logs to see whether that was likely to be the culprit. Is this the sort of
situation where you raise the ZooKeeper timeout a bit? Currently the
timeout for all nodes is 15 seconds.

Are there any known issues which might explain what's happening? I'm just
getting started with SolrCloud after using standard master/slave
replication for an index which has got too big for one machine over the
last few months.

Also, is there any particular information that would be helpful to help
with these issues if it should happen again?

Reply via email to