GW, Did you mean a separate transaction log on Solr or on Zookeeper? -suresh
On Tue, Jun 6, 2017 at 5:23 AM, GW <thegeofo...@gmail.com> wrote: > I've heard of systems tanking like this on Windows during OS updates. > Because of this, I run all my updates in attendance even though I'm Linux. > My Nodes run as VM's, I shut down Solr gracefully, snap shot a backup of > the VM, update and run. If things go screwy I can always roll back. To me > it sounds like a lack of resources or a kink in your networking, assuming > your set up is correct. Watch for home made network cables. I've seen soft > crimp connectors put on solid wire which can wreck a switch port forever. > Do you have a separate transaction log device on each Zookeeper? I made > this mistake in the beginning and had similar problems under load. > > > GW > > On 5 June 2017 at 22:32, Erick Erickson <erickerick...@gmail.com> wrote: > > > bq: This means that technically the replica nodes should not fall behind > > and do > > not have to go into recovery mode > > > > Well, true if nothing weird happens. By "weird" I mean anything that > > interferes with the leader getting anything other than a success code > > back from a follower it sends document to. > > > > bq: Is this the only scenario in which a node can go into recovery > status? > > > > No, there are others. One for-instance: Leader sends a doc to the > > follower and the request times out (huge GC pauses, the doc takes too > > long to index for whatever reason etc). The leader then sends a > > message to the follower to go directly into the recovery state since > > the leader has no way of knowing whether the follower successfully > > wrote the document to it's transaction log. You'll see messages about > > "leader initiated recovery" in the follower's solr log in this case. > > > > two bits of pedantry: > > > > bq: Down by the other replicas > > > > Almost. we're talking indexing here and IIUC only the leader can send > > another node into recovery as all updates go through the leader. > > > > If I'm going to be nit-picky, Zookeeper can _also_ cause a node to be > > marked as down if it's periodic ping of the node fails to return. > > Actually I think this is done through another Solr node that ZK > > notifies.... > > > > bq: It goes into a recovery mode and tries to recover all the > > documents from the leader of shard1. > > > > Also nit-picky. But if the follower isn't "too far" behind it can be > > brought back into sync from via "peer sync" where it gets the missed > > docs sent to it from the tlog of a healthy replica. "Too far" is 100 > > docs by default, but can be set in solrconfig.xml if necessary. If > > that limit is exceeded, then indeed the entire index is copied from > > the leader. > > > > Best, > > Erick > > > > > > > > On Mon, Jun 5, 2017 at 5:18 PM, suresh pendap <sureshfors...@gmail.com> > > wrote: > > > Hi, > > > > > > Why and in what scenarios do Solr nodes go into recovery status? > > > > > > Given that Solr is a CP system it means that the writes for a Document > > > index are acknowledged only after they are propagated and acknowledged > by > > > all the replicas of the Shard. > > > > > > This means that technically the replica nodes should not fall behind > and > > do > > > not have to go into recovery mode. > > > > > > Is my above understanding correct? > > > > > > Can a below scenario happen? > > > > > > 1. Assume that we have 3 replicas for Shard shard1 with the names > > > shard1_replica1, shard1_replica2 and shard1_replica3. > > > > > > 2. Due to some reason, network issue or something else, the > > shard1_replica2 > > > is not reachable by the other replicas and it is marked as Down by the > > > other replicas (shard1_replica1 and shard1_replica3 in this case) > > > > > > 3. The network issue is restored and the shard1_replica2 is reachable > > > again. It goes into a recovery mode and tries to recover all the > > documents > > > from the leader of shard1. > > > > > > Is this the only scenario in which a node can go into recovery status? > > > > > > In other words, does the node has to go into a Down status before > getting > > > back into a recovery status? > > > > > > > > > Regards > > >