Mickhail,
Thank you for your response. --- For the logs On not leader replica, there are no error in log, only WARN due to slow queries. On leader replica, there are these errors: * Twice per minute during all the day before the problem starts and also after the problem start RequestHandlerBase org.apache.solr.common.SolrException: Collection: xxxxxx not found where xxxxxx is the alias name pointing on the collection * Just after the problem start 2020-05-13 15:24:41.450 ERROR (qtp1682092198-315202) [c:xxxxxx_2 s:shard3 r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[ http://XXXXXX127:8983/solr/xxxxxx_2_shard1_replica1, http://XXXXXX132:8983/solr/xxxxxx_2_shard2_replica0] 2020-05-13 15:24:41.451 ERROR (qtp1682092198-315202) [c:xxxxxx_2 s:shard3 r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[ http://XXXXXX127:8983/solr/xxxxxx_2_shard1_replica1, http://XXXXXX132:8983/solr/xxxxxx_2_shard2_replica0] 2020-05-13 15:25:49.642 ERROR (qtp1682092198-315193) [c:xxxxxx_2 s:shard3 r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.s.HttpSolrCall null:java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 51815/50000 ms and later until the JVM hangs 2020-05-13 15:58:54.397 ERROR (qtp1682092198-316314) [c:xxxxxx_2 s:shard3 r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: no servers hosting shard: xxxxxx_2_shard2 No OOM errors in Solr logs, just OOM killer scripts log Running OOM killer script for process 4488 for Solr on port 8983 Killed process 4488 --- For heap dump I have dump for one shard leader just before the OOM script kill the JVM but more than one hour the problem starts. I will take a look. Regards. Dominique Le dim. 17 mai 2020 à 20:22, Mikhail Khludnev <m...@apache.org> a écrit : > Hello, Dominique. > What did it log? Which exception? > Do you have a chance to review heap dump? What did consume whole heap? > > On Sun, May 17, 2020 at 11:05 AM Dominique Bejean < > dominique.bej...@eolya.fr> wrote: > >> Hi, >> >> I have a six node Solrcoud that suddenly has its six nodes failed with OOM >> at the same time. >> This can happen even when the Solrcloud is not under heavy load and there >> is no indexing. >> >> I do not see any raison for this to happen. Here are the description of >> the >> issue. Thank you for your suggestions and advices. >> >> >> One or two hours before the nodes stop with OOM, we see this scenario on >> all six nodes during the same five minutes time frame : >> * a little bit more young gc : from one each second (duration<0.05secs) to >> one each two or three seconds (duration <0.15 sec) >> * full gc start occurs each 5sec with 0 bytes reclaimed >> * young gc start reclaim less bytes >> * long full gc start reclaim bytes but with less and less reclaimed bytes >> * then no more young GC >> Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png >> >> >> Just before the problem occurs : >> * there is no more requests per seconds >> * no update/commit/merge >> * CPU usage and load are low >> * disk I/O are low >> After the problem starts, requests become longer and longer but still no >> increase of CPU usage or disk I/O >> >> >> During last issue, we dumped the threads on one node just before OOM but >> unfortunately, more than one hour after the problem starts. >> 85% of threads (more than 3000) are BLOCKED and related to log4j >> Solr either try to log slow query or try to log problems in requesthandler >> at org.apache.solr.common.SolrException.log(SolrException.java:148) >> at >> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:204) >> >> This high count of BLOCKED threads is more a consequence than a cause. We >> will dump threads each minute until the next issue. >> >> >> About Solr environment : >> * Solr 6.6 >> * Java Oracle 1.8.0_112 25.112-b15 >> >> * 1 collection with 10 millions small documents >> * 3 shards x 2 replicas >> * 3.5 millions docs per core >> * 90 Gb index size per core >> >> * Server with 6 processors and 90 Gb of RAM >> * Swappiness set to 1, nearly no swap used >> * 4Gb Heap used nearly between 25 to 60% before young GC and one full GC >> (3 >> seconds) each 15 to 30 minutes when all is fine. >> >> * Default JVM settings with CMS GC >> * JMX enabled >> * Average Request per seconds in pic on one core : 170, but during the >> last >> issue the Average Request per seconds was 30 !!! >> * Average Time per seconds : < 30 ms >> >> About updates : >> * Very few add/updates in general >> * Some deleteByQuery (nearly 2000 per day) but not before the problem >> occurs >> * autocommit maxTime:15000ms >> >> About queries : >> * Queries are standard queries or suggesters >> * Queries generate facets but there is no fields with very high number of >> unique values >> * No grouping >> * High usage of function query for relevance computing >> >> >> Thank you. >> >> Dominique >> > > > -- > Sincerely yours > Mikhail Khludnev >