Hey Annette, Are you using Solr 4.0 final? A version of 4x or 5x?
Do you have the logs for when the replica tried to catch up to the leader? Stopping and starting the node is actually a fine thing to do. Perhaps you can try it again and capture the logs. If a node is not listed as live but is in the clusterstate, that is fine. It shouldn't be consulted. To remove it, you either have to unload it with the core admin api or you could manually delete it's registered state under the node states node that the Overseer looks at. Also, it would be useful to see the logs of the new node coming up…there should be info about what happens when it tries to replicate. It almost sounds like replication is just not working for your setup at all and that you have to tweak some configuration. You shouldn't see these nodes as active then though - so we should get to the bottom of this. - Mark On Dec 4, 2012, at 4:37 AM, Annette Newton <annette.new...@servicetick.com> wrote: > Hi all, > > I have a quite weird issue with Solr cloud. I have a 4 shard, 2 replica > setup, yesterday one of the nodes lost communication with the cloud setup, > which resulted in it trying to run replication, this failed, which has left > me with a Shard (Shard 4) that has one node with 2,833,940 documents on the > leader and 409,837 on the follower – obviously a big discrepancy and this > leads to queries returning differing results depending on which of these > nodes it gets the data from. There is no indication of a problem on the > admin site other than the big discrepancy in the number of documents. They > are all marked as active etc… > > So I thought that I would force replication to happen again, by stopping and > starting solr (probably the wrong thing to do) but this resulted in no > change. So I turned off that node and replaced it with a new one. In > zookeeper live nodes doesn’t list that machine but it is still being shown as > active on in the ClusterState.json, I have attached images showing this… > This means the new node hasn’t replaced the old node but is now a replica on > Shard 1! Also that node doesn’t appear to have replicated Shard 1’s data > anyway, it didn’t get marked with replicating or anything… > > How do I clear the zookeeper state without taking down the entire solr cloud > setup? How do I force a node to replicate from the others in the shard? > > Thanks in advance. > > Annette Newton > > > <LiveNodes.zip>