Re: Solrcloud 6.6 becomes nuts

Dominique Bejean Sun, 17 May 2020 15:19:25 -0700

Mickhail,


Thank you for your response.


--- For the logs

On not leader replica, there are no error in log, only WARN due to slow
queries.

On leader replica, there are these errors:

* Twice per minute during all the day before the problem starts and also
after the problem start
RequestHandlerBase org.apache.solr.common.SolrException: Collection: xxxxxx
not found
where xxxxxx is the alias name pointing on the collection

* Just after the problem start
2020-05-13 15:24:41.450 ERROR (qtp1682092198-315202) [c:xxxxxx_2 s:shard3
r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request:[
http://XXXXXX127:8983/solr/xxxxxx_2_shard1_replica1,
http://XXXXXX132:8983/solr/xxxxxx_2_shard2_replica0]
2020-05-13 15:24:41.451 ERROR (qtp1682092198-315202) [c:xxxxxx_2 s:shard3
r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.s.HttpSolrCall
null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request:[
http://XXXXXX127:8983/solr/xxxxxx_2_shard1_replica1,
http://XXXXXX132:8983/solr/xxxxxx_2_shard2_replica0]

2020-05-13 15:25:49.642 ERROR (qtp1682092198-315193) [c:xxxxxx_2 s:shard3
r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.s.HttpSolrCall
null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
timeout expired: 51815/50000 ms

and later until the JVM hangs
2020-05-13 15:58:54.397 ERROR (qtp1682092198-316314) [c:xxxxxx_2 s:shard3
r:core_node1 x:xxxxxx_2_shard3_replica0] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: no servers hosting shard:
xxxxxx_2_shard2

No OOM errors in Solr logs, just OOM killer scripts log
Running OOM killer script for process 4488 for Solr on port 8983
Killed process 4488


--- For heap dump

I have dump for one shard leader just before the OOM script kill the JVM
but more than one hour the problem starts. I will take a look.

Regards.

Dominique










Le dim. 17 mai 2020 à 20:22, Mikhail Khludnev <m...@apache.org> a écrit :

> Hello, Dominique.
> What did it log? Which exception?
> Do you have a chance to review heap dump? What did consume whole heap?
>
> On Sun, May 17, 2020 at 11:05 AM Dominique Bejean <
> dominique.bej...@eolya.fr> wrote:
>
>> Hi,
>>
>> I have a six node Solrcoud that suddenly has its six nodes failed with OOM
>> at the same time.
>> This can happen even when the Solrcloud is not under heavy load and there
>> is no indexing.
>>
>> I do not see any raison for this to happen. Here are the description of
>> the
>> issue. Thank you for your suggestions and advices.
>>
>>
>> One or two hours before the nodes stop with OOM, we see this scenario on
>> all six nodes during the same five minutes time frame :
>> * a little bit more young gc : from one each second (duration<0.05secs) to
>> one each two or three seconds (duration <0.15 sec)
>> * full gc start occurs each 5sec with 0 bytes reclaimed
>> * young gc start reclaim less bytes
>> * long full gc start reclaim bytes but with less and less reclaimed bytes
>> * then no more young GC
>> Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png
>>
>>
>> Just before the problem occurs :
>> * there is no more requests per seconds
>> * no update/commit/merge
>> * CPU usage and load are low
>> * disk I/O are low
>> After the problem starts, requests become longer and longer but still no
>> increase of CPU usage or disk I/O
>>
>>
>> During last issue, we dumped the threads on one node just before OOM but
>> unfortunately, more than one hour after the problem starts.
>> 85% of threads (more than 3000) are BLOCKED and related to log4j
>> Solr either try to log slow query or try to log problems in requesthandler
>> at org.apache.solr.common.SolrException.log(SolrException.java:148)
>> at
>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:204)
>>
>> This high count of BLOCKED threads is more a consequence than a cause. We
>> will dump threads each minute until the next issue.
>>
>>
>> About Solr environment :
>> * Solr 6.6
>> * Java Oracle 1.8.0_112 25.112-b15
>>
>> * 1 collection with 10 millions small documents
>> * 3 shards x 2 replicas
>> * 3.5 millions docs per core
>> * 90 Gb index size per core
>>
>> * Server with 6 processors and 90 Gb of RAM
>> * Swappiness set to 1, nearly no swap used
>> * 4Gb Heap used nearly between 25 to 60% before young GC and one full GC
>> (3
>> seconds) each 15 to 30 minutes when all is fine.
>>
>> * Default JVM settings with CMS GC
>> * JMX enabled
>> * Average Request per seconds in pic on one core : 170, but during the
>> last
>> issue the Average Request per seconds was 30 !!!
>> * Average Time per seconds : < 30 ms
>>
>> About updates :
>> * Very few add/updates in general
>> * Some deleteByQuery (nearly 2000 per day) but not before the problem
>> occurs
>> * autocommit maxTime:15000ms
>>
>> About queries :
>> * Queries are standard queries or suggesters
>> * Queries generate facets but there is no fields with very high number of
>> unique values
>> * No grouping
>> * High usage of function query for relevance computing
>>
>>
>> Thank you.
>>
>> Dominique
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Solrcloud 6.6 becomes nuts

Reply via email to