RE: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

Vassil Velichkov (Sensika) Tue, 15 Oct 2019 01:56:41 -0700

Hi Shawn,

I've reduced the JVM heap on one of the shards to 20GB and then simulated some 
heavy load to reproduce the issue in a faster way.
The solr.log ROOT was set to TRACE level, but I can't really see anything 
meaningful, the solr.log ends @ 07:31:40.352 GMT, while the GC log shows later 
entries and "Pause Full (Allocation Failure)".
BTW, I've never seen in the GC logs any automatic attempts for Full GC. I can't 
see any OOME messages in any of the logs, only in the separate solr_oom_killer 
log, but this is the log of the killer script.


Also, to answer your previous questions:
        1. We run completely stock Solr, not custom code, no plugins. 
Regardless, we never had such OOMs with Solr 4.x or Solr 6.x
        2. It seems that Full GC is never triggered. In some cases in the past 
I've seen log entries for Full GC attempts, but the JVM crashes with OOM long 
before the Full GC could do anything.
        3. On a side note - it seems that when a Solr query spans across 
multiple shards (our sharding is by timePublished), the HTTP connections from 
the aggregation node to the other shards frequently time-out @ 60 seconds, 
despite the Solr HTTP client request timeout is set dynamically to a much 
higher value (120-1200 seconds), and despite we've increased the timeout values 
in solr.xml for shardHandlerFactory (socketTimout / connTimeout) to 1200 
seconds. In such cases when we have inter-cluster aggregation timeouts the 
end-users get "Error retrieving data" and they usually refresh the App, 
basically re-running the heavy Solr queries over and over again. I included a 
sample from the application logs below. This usage-patter might also make the 
things worse - I don't know what happens within the shards when the aggregation 
fails due to timed-out inter-shard connections? If the all the shards keep 
executing the queries passed from the aggregation node, they just waste 
resources and all subsequent query re-runs just increase the resource 
consumption.

>>> SOLR.LOG (last 1 minute)
2019-10-15 07:31:12.848 DEBUG (Connection evictor) [   ] 
o.a.s.u.s.InstrumentedPoolingHttpClientConnectionManager Closing expired 
connections
2019-10-15 07:31:12.848 DEBUG (Connection evictor) [   ] 
o.a.s.u.s.InstrumentedPoolingHttpClientConnectionManager Closing connections 
idle longer than 50000 MILLISECONDS
2019-10-15 07:31:12.848 DEBUG (Connection evictor) [   ] 
o.a.s.u.s.InstrumentedPoolingHttpClientConnectionManager Closing expired 
connections
2019-10-15 07:31:40.352 DEBUG (Connection evictor) [   ] 
o.a.s.u.s.InstrumentedPoolingHttpClientConnectionManager Closing expired 
connections
2019-10-15 07:31:40.352 DEBUG (Connection evictor) [   ] 
o.a.s.u.s.InstrumentedPoolingHttpClientConnectionManager Closing connections 
idle longer than 50000 MILLISECONDS

>>> SOLR_GC.LOG (last 1 minute)
[2019-10-15T10:32:07.509+0300][528.164s] GC(64) Pause Full (Allocation Failure)
[2019-10-15T10:32:07.539+0300][528.193s] GC(64) Phase 1: Mark live objects
[2019-10-15T10:32:16.785+0300][537.440s] GC(64) Cleaned string and symbol 
table, strings: 23724 processed, 0 removed, symbols: 149625 processed, 0 removed
[2019-10-15T10:32:16.785+0300][537.440s] GC(64) Phase 1: Mark live objects 
9246.644ms
[2019-10-15T10:32:16.785+0300][537.440s] GC(64) Phase 2: Compute new object 
addresses
[2019-10-15T10:32:23.065+0300][543.720s] GC(64) Phase 2: Compute new object 
addresses 6279.790ms
[2019-10-15T10:32:23.065+0300][543.720s] GC(64) Phase 3: Adjust pointers
[2019-10-15T10:32:28.905+0300][549.560s] GC(64) Phase 3: Adjust pointers 
5839.647ms
[2019-10-15T10:32:28.905+0300][549.560s] GC(64) Phase 4: Move objects
[2019-10-15T10:32:28.905+0300][549.560s] GC(64) Phase 4: Move objects 0.135ms
[2019-10-15T10:32:28.921+0300][549.576s] GC(64) Using 8 workers of 8 to rebuild 
remembered set
[2019-10-15T10:32:34.763+0300][555.418s] GC(64) Eden regions: 0->0(160)
[2019-10-15T10:32:34.763+0300][555.418s] GC(64) Survivor regions: 0->0(40)
[2019-10-15T10:32:34.763+0300][555.418s] GC(64) Old regions: 565->565
[2019-10-15T10:32:34.763+0300][555.418s] GC(64) Humongous regions: 75->75
[2019-10-15T10:32:34.763+0300][555.418s] GC(64) Metaspace: 
52093K->52093K(1097728K)
[2019-10-15T10:32:34.764+0300][555.418s] GC(64) Pause Full (Allocation Failure) 
20383M->20383M(20480M) 27254.340ms
[2019-10-15T10:32:34.764+0300][555.419s] GC(64) User=56.35s Sys=0.03s 
Real=27.26s

>>> solr_oom_killer-8983-2019-10-15_07_32_34.log
Running OOM killer script for process 953 for Solr on port 8983
Killed process 953

>>> JVM GC Seettings
-Duser.timezone=UTC-XX:+ParallelRefProcEnabled
-XX:+UseG1GC
-XX:+UseLargePages
-XX:ConcGCThreads=8
-XX:G1HeapRegionSize=32m
-XX:NewRatio=3
-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/log/solr
-XX:ParallelGCThreads=8
-XX:SurvivorRatio=4
-Xlog:gc*:file=/var/log/solr/solr_gc.log:time,uptime:filecount=9,filesize=20M
-Xms20480M
-Xmx20480M
-Xss256k

>>> App Log Sample Entries
[2019-10-15 00:01:06] PRD-01-WEB-04.ERROR [0.000580]: search -> SOLR Exception 
[1/3]: queryId=[85e6a7d6f2a9f7e08fc438c862e4df9b], responseHttpCode=[500], 
solrResponseCode=[500], 
solrMessage=[org.apache.solr.client.solrj.SolrServerException: Timeout occured 
while waiting response from server at: http://172.18.1.206:8983/solr/post], 
SOLR Shards: Array
(
    [0] => 172.18.1.191:8983/solr/post
    [1] => 172.18.1.206:8983/solr/post
    [2] => 172.18.1.198:8983/solr/post
)


Cheers,
Vassil

-----Original Message-----
From: Shawn Heisey <apa...@elyograg.org> 
Sent: Monday, October 14, 2019 5:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any 
tests with Java 13 and the new ZGC?

On 10/14/2019 7:18 AM, Vassil Velichkov (Sensika) wrote:
>>>>>> After the migration from 6.x to 7.6 we kept the default GC for a couple 
>>>>>> of weeks, than we've started experimenting with G1 and we've managed to 
>>>>>> achieve less frequent OOM crashes, but not by much.

Changing your GC settings will never prevent OOMs.  The only way to prevent 
them is to either increase the resource that's running out or reconfigure the 
program to use less of that resource.

>>>>>> As I explained in my previous e-mail, the unused filterCache entries are 
>>>>>> not discarded, even after a new SolrSearcher is started. The Replicas 
>>>>>> are synced with the Masters every 5 minutes, the filterCache is 
>>>>>> auto-warmed and the JVM heap utilization keeps going up. Within 1 to 2 
>>>>>> hours a 64GB heap is being exhausted. The GC log entries clearly show 
>>>>>> that there are more and more humongous allocations piling up.

While it is true that the generation-specific collectors for G1 do not clean up 
humungous allocations from garbage, eventually Java will perform a full GC, 
which will be slow, but should clean them up.  If a full GC is not cleaning 
them up, that's a different problem, and one that I would suspect is actually a 
problem with your installation.  We have had memory leak bugs in Solr, but I am 
not aware of any that are as serious as your observations suggest.

You could be running into a memory leak ... but I really doubt that it is 
directly related to the filterCache or the humungous allocations. 
Upgrading to the latest release that you can would be advisable -- the latest 
7.x version would be my first choice, or you could go all the way to 8.2.0.

Are you running completely stock Solr, or have you added custom code? 
One of the most common problems with custom code is leaking searcher objects, 
which will cause Java to retain the large cache entries.  We have seen problems 
where one Solr version will work perfectly with custom code, but when Solr is 
upgraded, the custom code has memory leaks.

>>>>>>> We have a really stressful use-case: a single user opens a live-report 
>>>>>>> with 20-30 widgets, each widget performs a Solr Search or facet 
>>>>>>> aggregations, sometimes with 5-15 complex filter queries attached to 
>>>>>>> the main query, so the end results are visualized as pivot charts. So, 
>>>>>>> one user could trigger hundreds of queries in a very short period of 
>>>>>>> time and when we have several analysts working on the same time-period, 
>>>>>>> we usually end-up with OOM. This logic used to work quite well on Solr 
>>>>>>> 6.x. The only other difference that comes to my mind is that with Solr 
>>>>>>> 7.6 we've started using DocValues. I could not find documentation about 
>>>>>>> DocValues memory consumption, so it might be related.

For cases where docValues are of major benefit, which is primarily facets and 
sorting, Solr will use less memory with docValues than it does with indexed 
terms.  Adding docValues should not result in a dramatic increase in memory 
requirements, and in many cases, should actually require less memory.

>>>>>>> Yep, but I plan to generate some detailed JVM trace-dumps, so we could 
>>>>>>> analyze which class / data structure causes the OOM. Any 
>>>>>>> recommendations about what tool to use for a detailed JVM dump?

Usually the stacktrace itself is not helpful in diagnosing OOMs -- because the 
place where the error is thrown can be ANY allocation, not necessarily the one 
that is the major resource hog.

What I'm interested in here is the message immediately after the OOME, not the 
stacktrace.  Which I'll admit is slightly odd, because for many problems I *am* 
interested in the stacktrace.  OutOfMemoryError is one situation where the 
stacktrace is not very helpful, but the short message the error contains is 
useful.  I only asked for the stacktrace because collecting it will usually 
mean that nothing else in the message has been modified.

Here are two separate examples of what I am looking for:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Caused by: java.lang.OutOfMemoryError: unable to create new native thread

> Also, not sure if I could send attachments to the mailing list, but there 
> must be a way to share logs...?

There are many websites that facilitate file sharing.  One example, and the one 
that I use most frequently, is dropbox.  Sending attachments to the list rarely 
works.

Thanks,
Shawn

RE: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

Reply via email to