Hi Erick, I observed that the update request rate dropped from 20 per sec to 3 per sec for about 8 minutes. After that there is a huge burst of updates. This looks quite match the queue up behavior you mentioned. But I don't think the time out took that long. Is there a configurable setting for the time out? Also the bad tlog replica is not reachable at the time, so we did a DELETEREPLICA command with collections API to remove it from the cloud.
Thanks, Wei On Tue, Nov 19, 2019 at 5:52 AM Erick Erickson <erickerick...@gmail.com> wrote: > How long are updates blocked and how did the tlog replica on the bad > hardware go down? > > Solr has to wait for an ack back from the tlog follower to be certain that > the follower has all the documents in case it has to switch to that replica > to become the leader. If the update to the follower times out, the leader > will put it into a recovering state. > > So I’d expect the collection to queue up indexing until the request to the > follower on the bad hardware timed out, did you wait at least that long? > > Best, > Erick > > > On Nov 18, 2019, at 7:11 PM, Wei <weiwan...@gmail.com> wrote: > > > > Hi, > > > > I am puzzled by a problem in solr cloud with Tlog replicas and would > > appreciate your insights. Our solr cloud has two shards and each shard > > have 5 tlog replicas. When one of the non-leader replica has hardware > issue > > and become unreachable, updates to the whole cloud stopped. We are on > > solr 7.6 and use solrj client to send updates only to leaders. To my > > understanding, with Tlog replica type, the leader only forward update > > requests to replicas for transaction log update and each replica > > periodically pulls the segment from leader. When one replica fails to > > respond, why update requests to the cloud are blocked? Does leader need > > to wait for response from each replica to inform client that update is > > successful? > > > > Best, > > Wei > >