Well, if I'm reading this right you had a node go out of circulation and then bounced nodes until that node became the leader. So of course it wouldn't have the documents (how could it?). Basically you shot yourself in the foot.
Underlying here is why it took the machine you were re-starting so long to come up that you got impatient and started killing nodes. There has been quite a bit done to make that process better, so what version of Solr are you using? 4.4 is being voted on right now, so if you might want to consider upgrading. There was, for instance, a situation where it would take 3 minutes for machines to start up. How impatient were you? Also, what are your hard commit parameters? All of the documents you're indexing will be in the transaction log between hard commits, and when a node comes up the leader will replay everything in the tlog to the new node, which might be a source of why it took so long for the new node to come back up. At the very least the new node you were bringing back online will need to do a full index replication (old style) to get caught up. Best Erick On Fri, Jul 19, 2013 at 4:02 AM, Neil Prosser <neil.pros...@gmail.com> wrote: > While indexing some documents to a SolrCloud cluster (10 machines, 5 shards > and 2 replicas, so one replica on each machine) one of the replicas stopped > receiving documents, while the other replica of the shard continued to grow. > > That was overnight so I was unable to track exactly what happened (I'm > going off our Graphite graphs here). This morning when I was able to look > at the cluster both replicas of that shard were marked as down (with one > marked as leader). I attempted to restart the non-leader node but it took a > long time to restart so I killed it and restarted the old leader, which > also took a long time. I killed that one (I'm impatient) and left the > non-leader node to restart, not realising it was missing approximately 700k > documents that the old leader had. Eventually it restarted and became > leader. I restarted the old leader and it dropped the number of documents > it had to match the previous non-leader. > > Is this expected behaviour when a replica with fewer documents is started > before the other and elected leader? Should I have been paying more > attention to the number of documents on the server before restarting nodes? > > I am still in the process of tuning the caches and warming for these > servers but we are putting some load through the cluster so it is possible > that the nodes are having to work quite hard when a new version of the core > comes is made available. Is this likely to explain why I occasionally see > nodes dropping out? Unfortunately in restarting the nodes I lost the GC > logs to see whether that was likely to be the culprit. Is this the sort of > situation where you raise the ZooKeeper timeout a bit? Currently the > timeout for all nodes is 15 seconds. > > Are there any known issues which might explain what's happening? I'm just > getting started with SolrCloud after using standard master/slave > replication for an index which has got too big for one machine over the > last few months. > > Also, is there any particular information that would be helpful to help > with these issues if it should happen again?