Update for another observation: after the follower replica become unresponsive, I notice there are multiple commits happen on the leader within two minutes, and then seeing the following OOM error on leader:
o.a.s.s.HttpSolrCall null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Direct buffer memory at org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:662) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:530) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:531) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at .... The commits are not inline with our autocommit interval. I am wondering if the commits could be caused by the leader initialed recovery process. Will the Tlog leader do extra commits for the replica to sync up in recovery process? Best, Wei On Tue, Nov 19, 2019 at 1:22 PM Wei <weiwan...@gmail.com> wrote: > Hi Erick, > > I observed that the update request rate dropped from 20 per sec to 3 per > sec for about 8 minutes. After that there is a huge burst of updates. This > looks quite match the queue up behavior you mentioned. But I don't think > the time out took that long. Is there a configurable setting for the time > out? > Also the bad tlog replica is not reachable at the time, so we did a > DELETEREPLICA command with collections API to remove it from the cloud. > > Thanks, > Wei > > > On Tue, Nov 19, 2019 at 5:52 AM Erick Erickson <erickerick...@gmail.com> > wrote: > >> How long are updates blocked and how did the tlog replica on the bad >> hardware go down? >> >> Solr has to wait for an ack back from the tlog follower to be certain >> that the follower has all the documents in case it has to switch to that >> replica to become the leader. If the update to the follower times out, the >> leader will put it into a recovering state. >> >> So I’d expect the collection to queue up indexing until the request to >> the follower on the bad hardware timed out, did you wait at least that long? >> >> Best, >> Erick >> >> > On Nov 18, 2019, at 7:11 PM, Wei <weiwan...@gmail.com> wrote: >> > >> > Hi, >> > >> > I am puzzled by a problem in solr cloud with Tlog replicas and would >> > appreciate your insights. Our solr cloud has two shards and each shard >> > have 5 tlog replicas. When one of the non-leader replica has hardware >> issue >> > and become unreachable, updates to the whole cloud stopped. We are on >> > solr 7.6 and use solrj client to send updates only to leaders. To my >> > understanding, with Tlog replica type, the leader only forward update >> > requests to replicas for transaction log update and each replica >> > periodically pulls the segment from leader. When one replica fails to >> > respond, why update requests to the cloud are blocked? Does leader >> need >> > to wait for response from each replica to inform client that update is >> > successful? >> > >> > Best, >> > Wei >> >>