On 10/14/2019 7:18 AM, Vassil Velichkov (Sensika) wrote:
After the migration from 6.x to 7.6 we kept the default GC for a couple of weeks, than we've started experimenting with G1 and we've managed to achieve less frequent OOM crashes, but not by much.
Changing your GC settings will never prevent OOMs. The only way to prevent them is to either increase the resource that's running out or reconfigure the program to use less of that resource.
As I explained in my previous e-mail, the unused filterCache entries are not discarded, even after a new SolrSearcher is started. The Replicas are synced with the Masters every 5 minutes, the filterCache is auto-warmed and the JVM heap utilization keeps going up. Within 1 to 2 hours a 64GB heap is being exhausted. The GC log entries clearly show that there are more and more humongous allocations piling up.
While it is true that the generation-specific collectors for G1 do not clean up humungous allocations from garbage, eventually Java will perform a full GC, which will be slow, but should clean them up. If a full GC is not cleaning them up, that's a different problem, and one that I would suspect is actually a problem with your installation. We have had memory leak bugs in Solr, but I am not aware of any that are as serious as your observations suggest.
You could be running into a memory leak ... but I really doubt that it is directly related to the filterCache or the humungous allocations. Upgrading to the latest release that you can would be advisable -- the latest 7.x version would be my first choice, or you could go all the way to 8.2.0.
Are you running completely stock Solr, or have you added custom code? One of the most common problems with custom code is leaking searcher objects, which will cause Java to retain the large cache entries. We have seen problems where one Solr version will work perfectly with custom code, but when Solr is upgraded, the custom code has memory leaks.
We have a really stressful use-case: a single user opens a live-report with 20-30 widgets, each widget performs a Solr Search or facet aggregations, sometimes with 5-15 complex filter queries attached to the main query, so the end results are visualized as pivot charts. So, one user could trigger hundreds of queries in a very short period of time and when we have several analysts working on the same time-period, we usually end-up with OOM. This logic used to work quite well on Solr 6.x. The only other difference that comes to my mind is that with Solr 7.6 we've started using DocValues. I could not find documentation about DocValues memory consumption, so it might be related.
For cases where docValues are of major benefit, which is primarily facets and sorting, Solr will use less memory with docValues than it does with indexed terms. Adding docValues should not result in a dramatic increase in memory requirements, and in many cases, should actually require less memory.
Yep, but I plan to generate some detailed JVM trace-dumps, so we could analyze which class / data structure causes the OOM. Any recommendations about what tool to use for a detailed JVM dump?
Usually the stacktrace itself is not helpful in diagnosing OOMs -- because the place where the error is thrown can be ANY allocation, not necessarily the one that is the major resource hog.
What I'm interested in here is the message immediately after the OOME, not the stacktrace. Which I'll admit is slightly odd, because for many problems I *am* interested in the stacktrace. OutOfMemoryError is one situation where the stacktrace is not very helpful, but the short message the error contains is useful. I only asked for the stacktrace because collecting it will usually mean that nothing else in the message has been modified.
Here are two separate examples of what I am looking for: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space Caused by: java.lang.OutOfMemoryError: unable to create new native thread
Also, not sure if I could send attachments to the mailing list, but there must be a way to share logs...?
There are many websites that facilitate file sharing. One example, and the one that I use most frequently, is dropbox. Sending attachments to the list rarely works.
Thanks, Shawn