I'm going to share how I've debugged a similar OOM crash and solving it had nothing to do with increasing heap.
https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html This is specifically for Apache Ranger and how to fix it but you can treat it just like any application using Solr. There were a few things that caused issues "out of the blue": - Document TTL - The documents getting deleted after some time would trigger OOM (due to caches taking up too much heap) - Extra query load - caches again taking up too much memory - Extra inserts - too many commits refreshing caches and again going OOM Many of these can be reduced by using docvalues for fields that you typically sort/filter on. Kevin Risden On Wed, Apr 11, 2018 at 6:01 PM, Deepak Goel <deic...@gmail.com> wrote: > A few observations: > > 1. The Old Gen Heap on 9th April is about 6GB occupied which then runs up > to 9+GB on 10th April (It steadily increases throughout the day) > 2. The Old Gen GC is never able to reclaim any free memory > > > > Deepak > "Please stop cruelty to Animals, help by becoming a Vegan" > +91 73500 12833 > deic...@gmail.com > > Facebook: https://www.facebook.com/deicool > LinkedIn: www.linkedin.com/in/deicool > > "Plant a Tree, Go Green" > > On Wed, Apr 11, 2018 at 8:53 PM, Adam Harrison-Fuller < > aharrison-ful...@mintel.com> wrote: > > > In addition, here is the GC log leading up to the crash. > > > > https://www.dropbox.com/s/sq09d6hbss9b5ov/solr_gc_log_ > > 20180410_1009.zip?dl=0 > > > > Thanks! > > > > Adam > > > > On 11 April 2018 at 16:18, Adam Harrison-Fuller < > > aharrison-ful...@mintel.com > > > wrote: > > > > > Thanks for the advice so far. > > > > > > The directoryFactory is set to ${solr.directoryFactory:solr. > > NRTCachingDirectoryFactory}. > > > > > > > > > The servers workload is predominantly queries with updates taking place > > > once a day. It seems the servers are more likely to go down whilst the > > > servers are indexing but not exclusively so. > > > > > > I'm having issues locating the actual out of memory exception. I can > > tell > > > that it has ran out of memory as its called the oom_killer script which > > as > > > left a log file in the logs directory. I cannot find the actual > > exception > > > in the solr.log or our solr_gc.log, any suggestions? > > > > > > Cheers, > > > Adam > > > > > > > > > On 11 April 2018 at 15:49, Walter Underwood <wun...@wunderwood.org> > > wrote: > > > > > >> For readability, I’d use -Xmx12G instead of > -XX:MaxHeapSize=12884901888. > > >> Also, I always use a start size the same as the max size, since > servers > > >> will eventually grow to the max size. So: > > >> > > >> -Xmx12G -Xms12G > > >> > > >> wunder > > >> Walter Underwood > > >> wun...@wunderwood.org > > >> http://observer.wunderwood.org/ (my blog) > > >> > > >> > On Apr 11, 2018, at 6:29 AM, Sujay Bawaskar < > sujaybawas...@gmail.com> > > >> wrote: > > >> > > > >> > What is directory factory defined in solrconfig.xml? Your JVM heap > > >> should > > >> > be tuned up with respect to that. > > >> > How solr is being use, is it more updates and less query or less > > >> updates > > >> > more queries? > > >> > What is OOM error? Is it frequent GC or Error 12? > > >> > > > >> > On Wed, Apr 11, 2018 at 6:05 PM, Adam Harrison-Fuller < > > >> > aharrison-ful...@mintel.com> wrote: > > >> > > > >> >> Hey Jesus, > > >> >> > > >> >> Thanks for the suggestions. The Solr nodes have 4 CPUs assigned to > > >> them. > > >> >> > > >> >> Cheers! > > >> >> Adam > > >> >> > > >> >> On 11 April 2018 at 11:22, Jesus Olivan <jesus.oli...@letgo.com> > > >> wrote: > > >> >> > > >> >>> Hi Adam, > > >> >>> > > >> >>> IMHO you could try increasing heap to 20 Gb (with 46 Gb of > physical > > >> RAM, > > >> >>> your JVM can afford more RAM without threading penalties due to > > >> outside > > >> >>> heap RAM lacks. > > >> >>> > > >> >>> Another good one would be to increase > -XX:CMSInitiatingOccupancyFrac > > >> tion > > >> >>> =50 > > >> >>> to 75. I think that CMS collector works better when Old generation > > >> space > > >> >> is > > >> >>> more populated. > > >> >>> > > >> >>> I usually use to set Survivor spaces to lesser size. If you want > to > > >> try > > >> >>> SurvivorRatio to 6, i think performance would be improved. > > >> >>> > > >> >>> Another good practice for me would be to set an static NewSize > > instead > > >> >>> of -XX:NewRatio=3. > > >> >>> You could try to set -XX:NewSize=7000m and -XX:MaxNewSize=7000Mb > > (one > > >> >> third > > >> >>> of total heap space is recommended). > > >> >>> > > >> >>> Finally, my best results after a deep JVM I+D related to Solr, > came > > >> >>> removing ScavengeBeforeRemark flag and applying this new one: + > > >> >>> ParGCCardsPerStrideChunk. > > >> >>> > > >> >>> However, It would be a good one to set ParallelGCThreads and > > >> >>> *ConcGCThreads *to their optimal value, and we need you system CPU > > >> number > > >> >>> to know it. Can you provide this data, please? > > >> >>> > > >> >>> Regards > > >> >>> > > >> >>> > > >> >>> 2018-04-11 12:01 GMT+02:00 Adam Harrison-Fuller < > > >> >>> aharrison-ful...@mintel.com > > >> >>>> : > > >> >>> > > >> >>>> Hey all, > > >> >>>> > > >> >>>> I was wondering if I could get some JVM/GC tuning advice to > resolve > > >> an > > >> >>>> issue that we are experiencing. > > >> >>>> > > >> >>>> Full disclaimer, I am in no way a JVM/Solr expert so any advice > you > > >> can > > >> >>>> render would be greatly appreciated. > > >> >>>> > > >> >>>> Our Solr cloud nodes are having issues throwing OOM exceptions > > under > > >> >>> load. > > >> >>>> This issue has only started manifesting itself over the last few > > >> months > > >> >>>> during which time the only change I can discern is an increase in > > >> index > > >> >>>> size. They are running Solr 5.5.2 on OpenJDK version > "1.8.0_101". > > >> The > > >> >>>> index is currently 58G and the server has 46G of physical RAM and > > >> runs > > >> >>>> nothing other than the Solr node. > > >> >>>> > > >> >>>> The JVM is invoked with the following JVM options: > > >> >>>> -XX:CMSInitiatingOccupancyFraction=50 > > -XX:CMSMaxAbortablePrecleanTim > > >> e= > > >> >>> 6000 > > >> >>>> -XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark > > >> >>>> -XX:ConcGCThreads=4 -XX:InitialHeapSize=12884901888 > > >> >>> -XX:+ManagementServer > > >> >>>> -XX:MaxHeapSize=12884901888 -XX:MaxTenuringThreshold=8 > > >> >>>> -XX:NewRatio=3 -XX:OldPLABSize=16 > > >> >>>> -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 30000 > > >> >>>> /data/gnpd/solr/logs > > >> >>>> -XX:ParallelGCThreads=4 > > >> >>>> -XX:+ParallelRefProcEnabled -XX:PretenureSizeThreshold=67108864 > > >> >>>> -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime > > >> -XX:+PrintGCDateStamps > > >> >>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC > > >> >>>> -XX:+PrintTenuringDistribution -XX:SurvivorRatio=4 > > >> >>>> -XX:TargetSurvivorRatio=90 > > >> >>>> -XX:+UseCMSInitiatingOccupancyOnly -XX:+ > UseCompressedClassPointers > > >> >>>> -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC > > >> >>>> > > >> >>>> These values were decided upon serveral years by a colleague > based > > >> upon > > >> >>>> some suggestions from this mailing group with an index size ~25G. > > >> >>>> > > >> >>>> I have imported the GC logs into GCViewer and attached a link to > a > > >> >>>> screenshot showing the lead up to a OOM crash. Interestingly the > > >> young > > >> >>>> generation space is almost empty before the repeated GC's and > > >> >> subsequent > > >> >>>> crash. > > >> >>>> https://imgur.com/a/Wtlez > > >> >>>> > > >> >>>> I was considering slowly increasing the amount of heap available > to > > >> the > > >> >>> JVM > > >> >>>> slowly until the crashes, any other suggestions? I'm looking at > > >> trying > > >> >>> to > > >> >>>> get the nodes stable without having issues with the GC taking > > forever > > >> >> to > > >> >>>> run. > > >> >>>> > > >> >>>> Additional information can be provided on request. > > >> >>>> > > >> >>>> Cheers! > > >> >>>> Adam > > >> >>>> > > >> >>>> -- > > >> >>>> > > >> >>>> Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN > > >> >>>> Registered in > > >> >>>> England: Number 1475918. | VAT Number: GB 232 9342 72 > > >> >>>> > > >> >>>> Contact details for > > >> >>>> our other offices can be found at http://www.mintel.com/office- > > >> >> locations > > >> >>>> <http://www.mintel.com/office-locations>. > > >> >>>> > > >> >>>> This email and any attachments > > >> >>>> may include content that is confidential, privileged > > >> >>>> or otherwise > > >> >>>> protected under applicable law. Unauthorised disclosure, copying, > > >> >>>> distribution > > >> >>>> or use of the contents is prohibited and may be unlawful. If > > >> >>>> you have received this email in error, > > >> >>>> including without appropriate > > >> >>>> authorisation, then please reply to the sender about the error > > >> >>>> and delete > > >> >>>> this email and any attachments. > > >> >>>> > > >> >>>> > > >> >>> > > >> >> > > >> >> -- > > >> >> > > >> >> Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN > > >> >> Registered in > > >> >> England: Number 1475918. | VAT Number: GB 232 9342 72 > > >> >> > > >> >> Contact details for > > >> >> our other offices can be found at http://www.mintel.com/office-l > > >> ocations > > >> >> <http://www.mintel.com/office-locations>. > > >> >> > > >> >> This email and any attachments > > >> >> may include content that is confidential, privileged > > >> >> or otherwise > > >> >> protected under applicable law. Unauthorised disclosure, copying, > > >> >> distribution > > >> >> or use of the contents is prohibited and may be unlawful. If > > >> >> you have received this email in error, > > >> >> including without appropriate > > >> >> authorisation, then please reply to the sender about the error > > >> >> and delete > > >> >> this email and any attachments. > > >> >> > > >> >> > > >> > > > >> > > > >> > -- > > >> > Thanks, > > >> > Sujay P Bawaskar > > >> > M:+91-77091 53669 > > >> > > >> > > > > > > > -- > > > > Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN > > Registered in > > England: Number 1475918. | VAT Number: GB 232 9342 72 > > > > Contact details for > > our other offices can be found at http://www.mintel.com/office-locations > > <http://www.mintel.com/office-locations>. > > > > This email and any attachments > > may include content that is confidential, privileged > > or otherwise > > protected under applicable law. Unauthorised disclosure, copying, > > distribution > > or use of the contents is prohibited and may be unlawful. If > > you have received this email in error, > > including without appropriate > > authorisation, then please reply to the sender about the error > > and delete > > this email and any attachments. > > > > >