[ https://issues.apache.org/jira/browse/SOLR-13896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992933#comment-16992933 ]
ASF subversion and git services commented on SOLR-13896: -------------------------------------------------------- Commit c4f0c3363828c088eefa2b99783178848c2f1f7a in lucene-solr's branch refs/heads/master from Andrzej Bialecki [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c4f0c33 ] SOLR-13975, SOLR-13896: ConcurrentUpdateSolrClient connection stall prevention. > Paused a non-leader node can cause recovery on other nodes > ---------------------------------------------------------- > > Key: SOLR-13896 > URL: https://issues.apache.org/jira/browse/SOLR-13896 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Cao Manh Dat > Assignee: Andrzej Bialecki > Priority: Major > Attachments: SOLR-13896.patch > > > All stacktraces below based on 7.5 branch. This problem still exist at 8.x > branches. Here is the scenario, we have 3 replicas > * L: the leader replica > * R: the normal replica > * P: the poor one which was paused then resumed > L is trying to send data to R, P during that P get paused, here is what > happen at L's threads. > * Thread 1 is stucking at this line of StreamingSolrClients > {code:java} > public synchronized void blockUntilFinished() { > for (ConcurrentUpdateSolrClient client : solrClients.values()) { > client.blockUntilFinished(); > } > } {code} > basically this thread is trying to wait for other sender threads to finish. > Let's assume that this is the content of *solrClients.values : [clientToP, > clientToR]* > * Thread 2 coressponds to *clientToP* since P is paused, it doesn't close > the connection. it just keep the connection and never return any data backs > to L. So this thread stuck with this stack trace, waiting for response data > from *P* (with timeout=600000ms)*.* Therefore it cause the thread1 stuck at > *clientToP.blockUntilFinished()* > {code:java} > java.lang.Thread.State: RUNNABLE java.lang.Thread.State: RUNNABLE at > java.net.SocketInputStream.socketRead0(Native Method) at > java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at > java.net.SocketInputStream.read(SocketInputStream.java:171) at > java.net.SocketInputStream.read(SocketInputStream.java:141) at > org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) > at > org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) > at > org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282) > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) > at > org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) > at > org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) > at > org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) > at > org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) > at > org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) > at > org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120) > at > org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) > at > org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) at > org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at > org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) at > org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:347){code} > * Since *clientToR* is the second element of the array, is never get > called (or at least after the timeout). This problem cause Thread 3, to stuck > at this line > {code:java} > upd = queue.poll(pollQueueTime, TimeUnit.MILLISECONDS); {code} > note that pollQueueTime == Integer.MAX_VALUE (this set by > StreamingSolrClients). Therefore unless clientToR.blockUntilFinished() is > called (which interrupt Thread 3) this Thread 3 will stuck at above line > forever > * because *clientToR* is sending data to R but never close the outputstream, > so basically R just waiting forever (until timeout at 120000ms later). Which > then lead to this exception > {code:java} > o.a.s.h.RequestHandlerBase java.io.IOException: > java.util.concurrent.TimeoutException: Idle timeout expired: 120003/120000 > mso.a.s.h.RequestHandlerBase java.io.IOException: > java.util.concurrent.TimeoutException: Idle timeout expired: 120003/120000 ms > at > org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1080) > at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:313) at > org.apache.solr.servlet.ServletInputStreamWrapper.read(ServletInputStreamWrapper.java:74) > at > org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:100) > at > org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79) > at > org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:88) > at org.apache.solr.common.util.FastInputStream.peek(FastInputStream.java:60) > at > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:107) > at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55) > {code} > * After that the leader put all replicas including none-paused one to > recovery > > It is a very bad outcome and, this is not just theoretical problem since some > cloud platforms can freeze a node when doing maintenance. > Thanks [~ab] and [~shalin] on helping me debugging this problem. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org