While indexing some documents to a SolrCloud cluster (10 machines, 5 shards and 2 replicas, so one replica on each machine) one of the replicas stopped receiving documents, while the other replica of the shard continued to grow.
That was overnight so I was unable to track exactly what happened (I'm going off our Graphite graphs here). This morning when I was able to look at the cluster both replicas of that shard were marked as down (with one marked as leader). I attempted to restart the non-leader node but it took a long time to restart so I killed it and restarted the old leader, which also took a long time. I killed that one (I'm impatient) and left the non-leader node to restart, not realising it was missing approximately 700k documents that the old leader had. Eventually it restarted and became leader. I restarted the old leader and it dropped the number of documents it had to match the previous non-leader. Is this expected behaviour when a replica with fewer documents is started before the other and elected leader? Should I have been paying more attention to the number of documents on the server before restarting nodes? I am still in the process of tuning the caches and warming for these servers but we are putting some load through the cluster so it is possible that the nodes are having to work quite hard when a new version of the core comes is made available. Is this likely to explain why I occasionally see nodes dropping out? Unfortunately in restarting the nodes I lost the GC logs to see whether that was likely to be the culprit. Is this the sort of situation where you raise the ZooKeeper timeout a bit? Currently the timeout for all nodes is 15 seconds. Are there any known issues which might explain what's happening? I'm just getting started with SolrCloud after using standard master/slave replication for an index which has got too big for one machine over the last few months. Also, is there any particular information that would be helpful to help with these issues if it should happen again?