I'm going to share how I've debugged a similar OOM crash and solving it had
nothing to do with increasing heap.

https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html

This is specifically for Apache Ranger and how to fix it but you can treat
it just like any application using Solr.

There were a few things that caused issues "out of the blue":

   - Document TTL
      - The documents getting deleted after some time would trigger OOM
      (due to caches taking up too much heap)
   - Extra query load
      - caches again taking up too much memory
   - Extra inserts
      - too many commits refreshing caches and again going OOM

Many of these can be reduced by using docvalues for fields that you
typically sort/filter on.

Kevin Risden

On Wed, Apr 11, 2018 at 6:01 PM, Deepak Goel <deic...@gmail.com> wrote:

> A few observations:
>
> 1. The Old Gen Heap on 9th April is about 6GB occupied which then runs up
> to 9+GB on 10th April (It steadily increases throughout the day)
> 2. The Old Gen GC is never able to reclaim any free memory
>
>
>
> Deepak
> "Please stop cruelty to Animals, help by becoming a Vegan"
> +91 73500 12833
> deic...@gmail.com
>
> Facebook: https://www.facebook.com/deicool
> LinkedIn: www.linkedin.com/in/deicool
>
> "Plant a Tree, Go Green"
>
> On Wed, Apr 11, 2018 at 8:53 PM, Adam Harrison-Fuller <
> aharrison-ful...@mintel.com> wrote:
>
> > In addition, here is the GC log leading up to the crash.
> >
> > https://www.dropbox.com/s/sq09d6hbss9b5ov/solr_gc_log_
> > 20180410_1009.zip?dl=0
> >
> > Thanks!
> >
> > Adam
> >
> > On 11 April 2018 at 16:18, Adam Harrison-Fuller <
> > aharrison-ful...@mintel.com
> > > wrote:
> >
> > > Thanks for the advice so far.
> > >
> > > The directoryFactory is set to ${solr.directoryFactory:solr.
> > NRTCachingDirectoryFactory}.
> > >
> > >
> > > The servers workload is predominantly queries with updates taking place
> > > once a day.  It seems the servers are more likely to go down whilst the
> > > servers are indexing but not exclusively so.
> > >
> > > I'm having issues locating the actual out of memory exception.  I can
> > tell
> > > that it has ran out of memory as its called the oom_killer script which
> > as
> > > left a log file in the logs directory.  I cannot find the actual
> > exception
> > > in the solr.log or our solr_gc.log, any suggestions?
> > >
> > > Cheers,
> > > Adam
> > >
> > >
> > > On 11 April 2018 at 15:49, Walter Underwood <wun...@wunderwood.org>
> > wrote:
> > >
> > >> For readability, I’d use -Xmx12G instead of
> -XX:MaxHeapSize=12884901888.
> > >> Also, I always use a start size the same as the max size, since
> servers
> > >> will eventually grow to the max size. So:
> > >>
> > >> -Xmx12G -Xms12G
> > >>
> > >> wunder
> > >> Walter Underwood
> > >> wun...@wunderwood.org
> > >> http://observer.wunderwood.org/  (my blog)
> > >>
> > >> > On Apr 11, 2018, at 6:29 AM, Sujay Bawaskar <
> sujaybawas...@gmail.com>
> > >> wrote:
> > >> >
> > >> > What is directory factory defined in solrconfig.xml? Your JVM heap
> > >> should
> > >> > be tuned up with respect to that.
> > >> > How solr is being use,  is it more updates and less query or less
> > >> updates
> > >> > more queries?
> > >> > What is OOM error? Is it frequent GC or Error 12?
> > >> >
> > >> > On Wed, Apr 11, 2018 at 6:05 PM, Adam Harrison-Fuller <
> > >> > aharrison-ful...@mintel.com> wrote:
> > >> >
> > >> >> Hey Jesus,
> > >> >>
> > >> >> Thanks for the suggestions.  The Solr nodes have 4 CPUs assigned to
> > >> them.
> > >> >>
> > >> >> Cheers!
> > >> >> Adam
> > >> >>
> > >> >> On 11 April 2018 at 11:22, Jesus Olivan <jesus.oli...@letgo.com>
> > >> wrote:
> > >> >>
> > >> >>> Hi Adam,
> > >> >>>
> > >> >>> IMHO you could try increasing heap to 20 Gb (with 46 Gb of
> physical
> > >> RAM,
> > >> >>> your JVM can afford more RAM without threading penalties due to
> > >> outside
> > >> >>> heap RAM lacks.
> > >> >>>
> > >> >>> Another good one would be to increase
> -XX:CMSInitiatingOccupancyFrac
> > >> tion
> > >> >>> =50
> > >> >>> to 75. I think that CMS collector works better when Old generation
> > >> space
> > >> >> is
> > >> >>> more populated.
> > >> >>>
> > >> >>> I usually use to set Survivor spaces to lesser size. If you want
> to
> > >> try
> > >> >>> SurvivorRatio to 6, i think performance would be improved.
> > >> >>>
> > >> >>> Another good practice for me would be to set an static NewSize
> > instead
> > >> >>> of -XX:NewRatio=3.
> > >> >>> You could try to set -XX:NewSize=7000m and -XX:MaxNewSize=7000Mb
> > (one
> > >> >> third
> > >> >>> of total heap space is recommended).
> > >> >>>
> > >> >>> Finally, my best results after a deep JVM I+D related to Solr,
> came
> > >> >>> removing ScavengeBeforeRemark flag and applying this new one: +
> > >> >>> ParGCCardsPerStrideChunk.
> > >> >>>
> > >> >>> However, It would be a good one to set ParallelGCThreads and
> > >> >>> *ConcGCThreads *to their optimal value, and we need you system CPU
> > >> number
> > >> >>> to know it. Can you provide this data, please?
> > >> >>>
> > >> >>> Regards
> > >> >>>
> > >> >>>
> > >> >>> 2018-04-11 12:01 GMT+02:00 Adam Harrison-Fuller <
> > >> >>> aharrison-ful...@mintel.com
> > >> >>>> :
> > >> >>>
> > >> >>>> Hey all,
> > >> >>>>
> > >> >>>> I was wondering if I could get some JVM/GC tuning advice to
> resolve
> > >> an
> > >> >>>> issue that we are experiencing.
> > >> >>>>
> > >> >>>> Full disclaimer, I am in no way a JVM/Solr expert so any advice
> you
> > >> can
> > >> >>>> render would be greatly appreciated.
> > >> >>>>
> > >> >>>> Our Solr cloud nodes are having issues throwing OOM exceptions
> > under
> > >> >>> load.
> > >> >>>> This issue has only started manifesting itself over the last few
> > >> months
> > >> >>>> during which time the only change I can discern is an increase in
> > >> index
> > >> >>>> size.  They are running Solr 5.5.2 on OpenJDK version
> "1.8.0_101".
> > >> The
> > >> >>>> index is currently 58G and the server has 46G of physical RAM and
> > >> runs
> > >> >>>> nothing other than the Solr node.
> > >> >>>>
> > >> >>>> The JVM is invoked with the following JVM options:
> > >> >>>> -XX:CMSInitiatingOccupancyFraction=50
> > -XX:CMSMaxAbortablePrecleanTim
> > >> e=
> > >> >>> 6000
> > >> >>>> -XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark
> > >> >>>> -XX:ConcGCThreads=4 -XX:InitialHeapSize=12884901888
> > >> >>> -XX:+ManagementServer
> > >> >>>> -XX:MaxHeapSize=12884901888 -XX:MaxTenuringThreshold=8
> > >> >>>> -XX:NewRatio=3 -XX:OldPLABSize=16
> > >> >>>> -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 30000
> > >> >>>> /data/gnpd/solr/logs
> > >> >>>> -XX:ParallelGCThreads=4
> > >> >>>> -XX:+ParallelRefProcEnabled -XX:PretenureSizeThreshold=67108864
> > >> >>>> -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime
> > >> -XX:+PrintGCDateStamps
> > >> >>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC
> > >> >>>> -XX:+PrintTenuringDistribution -XX:SurvivorRatio=4
> > >> >>>> -XX:TargetSurvivorRatio=90
> > >> >>>> -XX:+UseCMSInitiatingOccupancyOnly -XX:+
> UseCompressedClassPointers
> > >> >>>> -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> > >> >>>>
> > >> >>>> These values were decided upon serveral years by a colleague
> based
> > >> upon
> > >> >>>> some suggestions from this mailing group with an index size ~25G.
> > >> >>>>
> > >> >>>> I have imported the GC logs into GCViewer and attached a link to
> a
> > >> >>>> screenshot showing the lead up to a OOM crash.  Interestingly the
> > >> young
> > >> >>>> generation space is almost empty before the repeated GC's and
> > >> >> subsequent
> > >> >>>> crash.
> > >> >>>> https://imgur.com/a/Wtlez
> > >> >>>>
> > >> >>>> I was considering slowly increasing the amount of heap available
> to
> > >> the
> > >> >>> JVM
> > >> >>>> slowly until the crashes, any other suggestions?  I'm looking at
> > >> trying
> > >> >>> to
> > >> >>>> get the nodes stable without having issues with the GC taking
> > forever
> > >> >> to
> > >> >>>> run.
> > >> >>>>
> > >> >>>> Additional information can be provided on request.
> > >> >>>>
> > >> >>>> Cheers!
> > >> >>>> Adam
> > >> >>>>
> > >> >>>> --
> > >> >>>>
> > >> >>>> Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
> > >> >>>> Registered in
> > >> >>>> England: Number 1475918. | VAT Number: GB 232 9342 72
> > >> >>>>
> > >> >>>> Contact details for
> > >> >>>> our other offices can be found at http://www.mintel.com/office-
> > >> >> locations
> > >> >>>> <http://www.mintel.com/office-locations>.
> > >> >>>>
> > >> >>>> This email and any attachments
> > >> >>>> may include content that is confidential, privileged
> > >> >>>> or otherwise
> > >> >>>> protected under applicable law. Unauthorised disclosure, copying,
> > >> >>>> distribution
> > >> >>>> or use of the contents is prohibited and may be unlawful. If
> > >> >>>> you have received this email in error,
> > >> >>>> including without appropriate
> > >> >>>> authorisation, then please reply to the sender about the error
> > >> >>>> and delete
> > >> >>>> this email and any attachments.
> > >> >>>>
> > >> >>>>
> > >> >>>
> > >> >>
> > >> >> --
> > >> >>
> > >> >> Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
> > >> >> Registered in
> > >> >> England: Number 1475918. | VAT Number: GB 232 9342 72
> > >> >>
> > >> >> Contact details for
> > >> >> our other offices can be found at http://www.mintel.com/office-l
> > >> ocations
> > >> >> <http://www.mintel.com/office-locations>.
> > >> >>
> > >> >> This email and any attachments
> > >> >> may include content that is confidential, privileged
> > >> >> or otherwise
> > >> >> protected under applicable law. Unauthorised disclosure, copying,
> > >> >> distribution
> > >> >> or use of the contents is prohibited and may be unlawful. If
> > >> >> you have received this email in error,
> > >> >> including without appropriate
> > >> >> authorisation, then please reply to the sender about the error
> > >> >> and delete
> > >> >> this email and any attachments.
> > >> >>
> > >> >>
> > >> >
> > >> >
> > >> > --
> > >> > Thanks,
> > >> > Sujay P Bawaskar
> > >> > M:+91-77091 53669
> > >>
> > >>
> > >
> >
> > --
> >
> > Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
> > Registered in
> > England: Number 1475918. | VAT Number: GB 232 9342 72
> >
> > Contact details for
> > our other offices can be found at http://www.mintel.com/office-locations
> > <http://www.mintel.com/office-locations>.
> >
> > This email and any attachments
> > may include content that is confidential, privileged
> > or otherwise
> > protected under applicable law. Unauthorised disclosure, copying,
> > distribution
> > or use of the contents is prohibited and may be unlawful. If
> > you have received this email in error,
> > including without appropriate
> > authorisation, then please reply to the sender about the error
> > and delete
> > this email and any attachments.
> >
> >
>

Reply via email to