First thing is to stop using CMS and use G1GC. We’ve been using these settings with over a hundred machines in prod for nearly four years.
SOLR_HEAP=8g # Use G1 GC -- wunder 2017-01-23 # Settings from https://wiki.apache.org/solr/ShawnHeisey GC_TUNE=" \ -XX:+UseG1GC \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=8m \ -XX:MaxGCPauseMillis=200 \ -XX:+UseLargePages \ -XX:+AggressiveOpts \ " wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 7, 2020, at 2:39 AM, Karol Grzyb <grz...@gmail.com> wrote: > > Hi Matthew, Erick! > > Thank you very much for the feedback, I'll try to convince them to > reduce the heap size. > > current GC settings: > > -XX:+CMSParallelRemarkEnabled > -XX:+CMSScavengeBeforeRemark > -XX:+ParallelRefProcEnabled > -XX:+UseCMSInitiatingOccupancyOnly > -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC > -XX:CMSInitiatingOccupancyFraction=50 > -XX:CMSMaxAbortablePrecleanTime=6000 > -XX:ConcGCThreads=4 > -XX:MaxTenuringThreshold=8 > -XX:NewRatio=3 > -XX:ParallelGCThreads=4 > -XX:PretenureSizeThreshold=64m > -XX:SurvivorRatio=4 > -XX:TargetSurvivorRatio=90 > > Kind regards, > Karol > > > wt., 6 paź 2020 o 16:52 Erick Erickson <erickerick...@gmail.com> napisał(a): >> >> 12G is not that huge, it’s surprising that you’re seeing this problem. >> >> However, there are a couple of things to look at: >> >> 1> If you’re saying that you have 16G total physical memory and are >> allocating 12G to Solr, that’s an anti-pattern. See: >> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >> If at all possible, you should allocate between 25% and 50% of your physical >> memory to Solr... >> >> 2> what garbage collector are you using? G1GC might be a better choice. >> >>> On Oct 6, 2020, at 10:44 AM, matthew sporleder <msporle...@gmail.com> wrote: >>> >>> Your index is so small that it should easily get cached into OS memory >>> as it is accessed. Having a too-big heap is a known problem >>> situation. >>> >>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed? >>> >>> On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb <grz...@gmail.com> wrote: >>>> >>>> Hi Matthew, >>>> >>>> Thank you for the answer, I cannot reproduce the setup locally I'll >>>> try to convince them to reduce Xmx, I guess they will rather not agree >>>> to 1GB but something less than 12G for sure. >>>> And have some proper dev setup because for now we could only test prod >>>> or stage which are difficult to adjust. >>>> >>>> Is being stuck in GC common behaviour when the index is small compared >>>> to available heap during bigger load? I was more worried about the >>>> ratio of heap to total host memory. >>>> >>>> Regards, >>>> Karol >>>> >>>> >>>> wt., 6 paź 2020 o 14:39 matthew sporleder <msporle...@gmail.com> >>>> napisał(a): >>>>> >>>>> You have a 12G heap for a 200MB index? Can you just try changing Xmx >>>>> to, like, 1g ? >>>>> >>>>> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb <grz...@gmail.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I'm involved in investigation of issue that involves huge GC overhead >>>>>> that happens during performance tests on Solr Nodes. Solr version is >>>>>> 6.1. Last test were done on staging env, and we run into problems for >>>>>> <100 requests/second. >>>>>> >>>>>> The size of the index itself is ~200MB ~ 50K docs >>>>>> Index has small updates every 15min. >>>>>> >>>>>> >>>>>> >>>>>> Queries involve sorting and faceting. >>>>>> >>>>>> I've gathered some heap dumps, I can see from them that most of heap >>>>>> memory is retained because of object of following classes: >>>>>> >>>>>> -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector >>>>>> (>4G, 91% of heap) >>>>>> -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs >>>>>> -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue >>>>>> -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector >>>>>> (>3.7G 76% of heap) >>>>>> >>>>>> >>>>>> >>>>>> Based on information above is there anything generic that can been >>>>>> looked at as source of potential improvement without diving deeply >>>>>> into schema and queries (which may be very difficlut to change at this >>>>>> moment)? I don't see docvalues being enabled - could this help, as if >>>>>> I get the docs correctly, it's specifically helpful when there are >>>>>> many sorts/grouping/facets? Or I >>>>>> >>>>>> Additionaly I see, that many threads are blocked on LRUCache.get, >>>>>> should I recomend switching to FastLRUCache? >>>>>> >>>>>> Also, I wonder if -Xmx12288m for java heap is not too much for 16G >>>>>> memory? I see some (~5/s) page faults in Dynatrace during the biggest >>>>>> traffic. >>>>>> >>>>>> Thank you very much for any help, >>>>>> Kind regards, >>>>>> Karol >>