I've heard of systems tanking like this on Windows during OS updates. Because of this, I run all my updates in attendance even though I'm Linux. My Nodes run as VM's, I shut down Solr gracefully, snap shot a backup of the VM, update and run. If things go screwy I can always roll back. To me it sounds like a lack of resources or a kink in your networking, assuming your set up is correct. Watch for home made network cables. I've seen soft crimp connectors put on solid wire which can wreck a switch port forever. Do you have a separate transaction log device on each Zookeeper? I made this mistake in the beginning and had similar problems under load.
GW On 5 June 2017 at 22:32, Erick Erickson <erickerick...@gmail.com> wrote: > bq: This means that technically the replica nodes should not fall behind > and do > not have to go into recovery mode > > Well, true if nothing weird happens. By "weird" I mean anything that > interferes with the leader getting anything other than a success code > back from a follower it sends document to. > > bq: Is this the only scenario in which a node can go into recovery status? > > No, there are others. One for-instance: Leader sends a doc to the > follower and the request times out (huge GC pauses, the doc takes too > long to index for whatever reason etc). The leader then sends a > message to the follower to go directly into the recovery state since > the leader has no way of knowing whether the follower successfully > wrote the document to it's transaction log. You'll see messages about > "leader initiated recovery" in the follower's solr log in this case. > > two bits of pedantry: > > bq: Down by the other replicas > > Almost. we're talking indexing here and IIUC only the leader can send > another node into recovery as all updates go through the leader. > > If I'm going to be nit-picky, Zookeeper can _also_ cause a node to be > marked as down if it's periodic ping of the node fails to return. > Actually I think this is done through another Solr node that ZK > notifies.... > > bq: It goes into a recovery mode and tries to recover all the > documents from the leader of shard1. > > Also nit-picky. But if the follower isn't "too far" behind it can be > brought back into sync from via "peer sync" where it gets the missed > docs sent to it from the tlog of a healthy replica. "Too far" is 100 > docs by default, but can be set in solrconfig.xml if necessary. If > that limit is exceeded, then indeed the entire index is copied from > the leader. > > Best, > Erick > > > > On Mon, Jun 5, 2017 at 5:18 PM, suresh pendap <sureshfors...@gmail.com> > wrote: > > Hi, > > > > Why and in what scenarios do Solr nodes go into recovery status? > > > > Given that Solr is a CP system it means that the writes for a Document > > index are acknowledged only after they are propagated and acknowledged by > > all the replicas of the Shard. > > > > This means that technically the replica nodes should not fall behind and > do > > not have to go into recovery mode. > > > > Is my above understanding correct? > > > > Can a below scenario happen? > > > > 1. Assume that we have 3 replicas for Shard shard1 with the names > > shard1_replica1, shard1_replica2 and shard1_replica3. > > > > 2. Due to some reason, network issue or something else, the > shard1_replica2 > > is not reachable by the other replicas and it is marked as Down by the > > other replicas (shard1_replica1 and shard1_replica3 in this case) > > > > 3. The network issue is restored and the shard1_replica2 is reachable > > again. It goes into a recovery mode and tries to recover all the > documents > > from the leader of shard1. > > > > Is this the only scenario in which a node can go into recovery status? > > > > In other words, does the node has to go into a Down status before getting > > back into a recovery status? > > > > > > Regards >