I've heard of systems tanking like this on Windows during OS updates.
Because of this, I run all my updates in attendance even though I'm Linux.
My Nodes run as VM's, I shut down Solr gracefully, snap shot a backup of
the VM, update and run. If things go screwy I can always roll back. To me
it sounds like a lack of resources or a kink in your networking, assuming
your set up is correct. Watch for home made network cables. I've seen soft
crimp connectors put on solid wire which can wreck a switch port forever.
Do you have a separate transaction log device on each Zookeeper? I made
this mistake in the beginning and had similar problems under load.


GW

On 5 June 2017 at 22:32, Erick Erickson <erickerick...@gmail.com> wrote:

> bq: This means that technically the replica nodes should not fall behind
> and do
> not have to go into recovery mode
>
> Well, true if nothing weird happens. By "weird" I mean anything that
> interferes with the leader getting anything other than a success code
> back from a follower it sends  document to.
>
> bq: Is this the only scenario in which a node can go into recovery status?
>
> No, there are others. One for-instance: Leader sends a doc to the
> follower and the request times out (huge  GC pauses, the doc takes too
> long to index for whatever reason etc). The leader then sends a
> message to the follower to go directly into the recovery state since
> the leader has no way of knowing whether the follower successfully
> wrote the document to it's transaction log. You'll see messages about
> "leader initiated recovery" in the follower's solr log in this case.
>
> two bits of pedantry:
>
> bq:  Down by the other replicas
>
> Almost. we're talking indexing here and IIUC only the leader can send
> another node into recovery as all updates go through the leader.
>
> If I'm going to be nit-picky, Zookeeper can _also_ cause a node to be
> marked as down if it's periodic ping of the node fails to return.
> Actually I think this is done through another Solr node that ZK
> notifies....
>
> bq: It goes into a recovery mode and tries to recover all the
> documents from the leader of shard1.
>
> Also nit-picky. But if the follower isn't "too far" behind it can be
> brought back into sync from via "peer sync" where it gets the missed
> docs sent to it from the tlog of a healthy replica. "Too far" is 100
> docs by default, but can be set in solrconfig.xml if necessary. If
> that limit is exceeded, then indeed the entire index is copied from
> the leader.
>
> Best,
> Erick
>
>
>
> On Mon, Jun 5, 2017 at 5:18 PM, suresh pendap <sureshfors...@gmail.com>
> wrote:
> > Hi,
> >
> > Why and in what scenarios do Solr nodes go into recovery status?
> >
> > Given that Solr is a CP system it means that the writes for a Document
> > index are acknowledged only after they are propagated and acknowledged by
> > all the replicas of the Shard.
> >
> > This means that technically the replica nodes should not fall behind and
> do
> > not have to go into recovery mode.
> >
> > Is my above understanding correct?
> >
> > Can a below scenario happen?
> >
> > 1. Assume that we have 3 replicas for Shard shard1 with the names
> > shard1_replica1, shard1_replica2 and shard1_replica3.
> >
> > 2. Due to some reason, network issue or something else, the
> shard1_replica2
> > is not reachable by the other replicas and it is marked as Down by the
> > other replicas (shard1_replica1 and shard1_replica3 in this case)
> >
> > 3. The network issue is restored and the shard1_replica2 is reachable
> > again. It goes into a recovery mode and tries to recover all the
> documents
> > from the leader of shard1.
> >
> > Is this the only scenario in which a node can go into recovery status?
> >
> > In other words, does the node has to go into a Down status before getting
> > back into a recovery status?
> >
> >
> > Regards
>

Reply via email to