Hi,

I have a six node Solrcoud that suddenly has its six nodes failed with OOM
at the same time.
This can happen even when the Solrcloud is not under heavy load and there
is no indexing.

I do not see any raison for this to happen. Here are the description of the
issue. Thank you for your suggestions and advices.


One or two hours before the nodes stop with OOM, we see this scenario on
all six nodes during the same five minutes time frame :
* a little bit more young gc : from one each second (duration<0.05secs) to
one each two or three seconds (duration <0.15 sec)
* full gc start occurs each 5sec with 0 bytes reclaimed
* young gc start reclaim less bytes
* long full gc start reclaim bytes but with less and less reclaimed bytes
* then no more young GC
Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png


Just before the problem occurs :
* there is no more requests per seconds
* no update/commit/merge
* CPU usage and load are low
* disk I/O are low
After the problem starts, requests become longer and longer but still no
increase of CPU usage or disk I/O


During last issue, we dumped the threads on one node just before OOM but
unfortunately, more than one hour after the problem starts.
85% of threads (more than 3000) are BLOCKED and related to log4j
Solr either try to log slow query or try to log problems in requesthandler
at org.apache.solr.common.SolrException.log(SolrException.java:148)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:204)

This high count of BLOCKED threads is more a consequence than a cause. We
will dump threads each minute until the next issue.


About Solr environment :
* Solr 6.6
* Java Oracle 1.8.0_112 25.112-b15

* 1 collection with 10 millions small documents
* 3 shards x 2 replicas
* 3.5 millions docs per core
* 90 Gb index size per core

* Server with 6 processors and 90 Gb of RAM
* Swappiness set to 1, nearly no swap used
* 4Gb Heap used nearly between 25 to 60% before young GC and one full GC (3
seconds) each 15 to 30 minutes when all is fine.

* Default JVM settings with CMS GC
* JMX enabled
* Average Request per seconds in pic on one core : 170, but during the last
issue the Average Request per seconds was 30 !!!
* Average Time per seconds : < 30 ms

About updates :
* Very few add/updates in general
* Some deleteByQuery (nearly 2000 per day) but not before the problem occurs
* autocommit maxTime:15000ms

About queries :
* Queries are standard queries or suggesters
* Queries generate facets but there is no fields with very high number of
unique values
* No grouping
* High usage of function query for relevance computing


Thank you.

Dominique

Reply via email to