SolrCloud 6.6.2 suddenly crash due to slow queries and Log4j issue
Hi, A few months ago, I reported an issue with Solr nodes crashing due to the old generation heap growing suddenly and generating OOM. This problem occurred again this week. I have threads dumps for each minute during the 3 minutes the problem occured. I am using fastthread.io in order to analyse these dumps. The threads scenario on failing node is === 15h54 -> it looks fine Old gen heap : 0,5 Gb (3Gb max) 67 threads TIMED_WAITING 26 threads RUNNABLE 7 threads WAITING === 15h55 -> fastthreads reports few suspects Old gen heap stars growing : from 0,5 Gb to 2 Gb (3Gb max) 42 threads TIMED_WAITING 41 threads RUNNABLE 10 threads WAITING The first symptom is 8 runnable threads are stuck (same stack trace) waiting for response from some other nodes java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) at org.apache.http.impl.io.SocketInputBuffer.isDataAvailable(SocketInputBuffer.java:95) at org.apache.http.impl.AbstractHttpClientConnection.isStale(AbstractHttpClientConnection.java:310) at org.apache.http.impl.conn.ManagedClientConnectionImpl.isStale(ManagedClientConnectionImpl.java:158) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:433) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:515) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268) at org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:447) at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:388) at org.apache.solr.handler.component.HttpShardHandlerFactory.makeLoadBalancedRequest(HttpShardHandlerFactory.java:302) at org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:166) at org.apache.solr.handler.component.HttpShardHandler$$Lambda$192/1788637481.call(Unknown Source) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$15/986729174.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) === 15h56 -> fastthreads reports issue Old gen heap full : from 3Gb (3Gb max) 57 threads TIMED_WAITING 126 threads RUNNABLE 18 threads WAITING 14 threads BLOCKED 7 runnable threads are still stuck (same stack trace) waiting for response from some other nodes 1 BLOCKED thread obtained org.apache.log4j.Logger's lock & did not release it due to that 13 threads are BLOCKED (same stack trace) on org.apache.log4j.Category.callAppenders java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.log4j.Category.callAppenders(Category.java:204) - waiting to lock <0x0007005a6f08> (a org.apache.log4j.Logger) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:304) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2482) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.C
Re: SolrCloud 6.6.2 suddenly crash due to slow queries and Log4j issue
On 10/18/2020 3:22 AM, Dominique Bejean wrote: A few months ago, I reported an issue with Solr nodes crashing due to the old generation heap growing suddenly and generating OOM. This problem occurred again this week. I have threads dumps for each minute during the 3 minutes the problem occured. I am using fastthread.io in order to analyse these dumps. * The Log4j issue starts ( https://blog.fastthread.io/2020/01/24/log4j-bug-slows-down-your-app/) If the log4j bug is the root cause here, then the only way you can fix this is to upgrade to at least Solr 7.4. That is the Solr version where we first upgraded from log4j 1.2.x to log4j2. You cannot upgrade log4j in Solr 6.6.2 without changing Solr code. The code changes required were extensive. Note that I did not do anything to confirm whether the log4j bug is responsible here. You seem pretty confident that this is the case. Note that if you upgrade to 8.x, you will need to reindex from scratch. Upgrading an existing index is possible with one major version bump, but if your index has ever been touched by a release that's two major versions back, it won't work. In 8.x, that is enforced -- 8.x will not even try to read an old index touched by 6.x or earlier. In the following wiki page, I provided instructions for getting a screenshot of the process listing. https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems In addition to that screenshot, I would like to know the on-disk size of all the cores running on the problem node, along with a document count from those cores. It might be possible to work around the OOM just by increasing the size of the heap. That won't do anything about problems with log4j. Thanks, Shawn