GCViewer will work on the GC logging file created by Solr. It has some nice summaries, particularly of stop-the-world GC events.
From there, you can pinpoint the exact times from manual inspection of the GC log. It usually looks like this: Your replica went into recovery when again? sometime between 14:00 and 15:00. GCViewer: Oh, look. Sometime between 14:30 and 14:35 there was a 20 second stop-the-world GC pause. Ok, back to the Solr log and I see “leader initiated recovery” sent by the leader to this replica at 14:32. Oh look, the GC log shows the stop-the-world GC pause starting at 14:31:45. Let's tune GC. That’s if you want to pin it exactly, but usually it looks more like this: Oh, look. GCViewer shows long stop-the-world GC pauses about the time your replica went into recovery, let’s start tuning GC. There are other products that give you a good way to navigate the GC events, GCViewer is free though. Best, Erick > On Mar 18, 2019, at 10:17 AM, Jeff Courtade <courtadej...@gmail.com> wrote: > > So, > > I had a problem when at a customer site. They use zabbix for data > collection and alerting. > > The solr server had been setup to use only jmx metrics. > > the jvm was unstable and would lock up for a period of time and the metrics > and counters would be all screwed up. Because it was using jmx to alert it > was screwing up as the jvm needed to be working to be used. > > So I turned on gclogging and wrote a script to collect data points about > for instance how long the jvm was stopped in the last minute. > > I eventually got the gc tuned and behaving well but it was difficult. > > > turn on gcloging > > i use these options > > -Xloggc:../var/logs/gclog.log \ > -XX:+PrintHeapAtGC \ > -XX:+PrintGCDetails \ > -XX:+PrintGCDateStamps \ > -XX:+PrintGCTimeStamps \ > -XX:+PrintTenuringDistribution \ > -XX:+PrintGCApplicationStoppedTime \ > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 \ > -XX:GCLogFileSize=100M > > in the solr runtme users crontab on each system .... > > * * * * * nohup /opt/scripts/getstats-gclog & > > this is the script > > /opt/scripts/getstats-gclog > #/bin/bash -x > # > # get some statistics > # > # GC time stamp > # 2018-06-27T12:52:57.200+0000 > # > FIVEMIN=`date --date '5 minutes ago' +'%Y-%m-%dT%H:%M'` > FOURMIN=`date --date '4 minutes ago' +'%Y-%m-%dT%H:%M'` > THREEMIN=`date --date '3 minutes ago' +'%Y-%m-%dT%H:%M'` > TWOMIN=`date --date '2 minutes ago' +'%Y-%m-%dT%H:%M'` > ONEMIN=`date --date '1 minute ago' +'%Y-%m-%dT%H:%M'` > YEAR=`date --date '1 minute ago' +'%Y'` > MONTH=`date --date '1 minute ago' +'%m'` > DAY=`date --date '1 minute ago' +'%d'` > HOUR=`date --date '1 minute ago' +'%H'` > MINUTE=`date --date '1 minute ago' +'%M'` > SECOND=`date --date '1 minute ago' +'%S'` > # > # > STATSDIR=/opt/stats/ > WORKDIR=/$STATSDIR/working_gc > # > # > LOGDIR=/u01/app/solr/var/logs/ > LOGNAME=gclog.log > Prep() { > mkdir -p $STATSDIR > chmod 755 $STATSDIR > mkdir -p $WORKDIR > chmod 755 $WORKDIR > } > GetStats() { > cd $WORKDIR > grep $ONEMIN $LOGDIR/$LOGNAME.*|grep stopped|awk '{print $11}'>ALLGC > COUNT=`cat ALLGC |wc -l` > # number under .00X XDecimalplaces U3D for example > U3D=`grep "^0\.00[1-9]" ALLGC|wc -l` > if [ -z $U3D ] > then > U3D=0 > fi > U2D=`grep "^0\.0[1-9]" ALLGC|wc -l` > if [ -z $U2D ] > then > U2D=0 > fi > U1D=`grep "^0\.[1-9]" ALLGC|wc -l` > if [ -z $U1D ] > then > U1D=0 > fi > O1S=`cat ALLGC | grep -v "^0\."|wc -l` > if [ -z $O1S ] > then > O1S=0 > fi > O10S=`grep "[0-9]\+[0-9]\.[0-9]*" ALLGC|wc -l` > if [ -z $O10S ] > then > O10S=0 > fi > cat ALLGC | grep -v "^0\.">OVER1SECDATA > > TOTAL=0 > COUNT=0 > while read DAT > do > TOTAL=`echo "$TOTAL + $DAT"|bc` > COUNT=`expr $COUNT + 1` > done <$WORKDIR/ALLGC > #AO1S=$(printf "%.2f\n" `echo "scale=10;$COUNT/60" | bc`) > #AVGQT=$(printf "%.0f\n" `echo "scale=10;$TOTAL/$COUNT"|bc`) > TOTSTOP=$TOTAL > AVGSTOPT=`echo "scale=7;$TOTAL/$COUNT"|bc` > if [ -z $AVGSTOPT ] > then > AVGSTOPT=0 > fi > # get top gc times > # > > #echo 0.0000000>ALLGCU1S > #echo 0.0000000>ALLGCO1S > > grep '^0.' $WORKDIR/ALLGC >ALLGCU1S > grep -v '^0.' $WORKDIR/ALLGC >ALLGCO1S > TOPGCTIMEU1S=`cat $WORKDIR/ALLGCU1S |sort |tail -1` > if [ -z $TOPGCTIMEU1S ] > then > TOPGCTIMEU1S=0 > fi > TOPGCTIMEO1S=`cat $WORKDIR/ALLGCO1S |sort |tail -1` > if [ -z $TOPGCTIMEO1S ] > then > TOPGCTIMEO1S=0 > fi > > } > PrintStats() { > # > ## stats > #COUNT= total number of garbage collection this minute > ## U3d = Total number of GC that are under 0.00Xseconds > # U2D total number of GC that are under 0.0X seconds > # U1D total number of GC that are under 0.X seconds > # O1S total number of GC that are over 1 second but under 10 > # O10S total number of GC that are over 10 seconds > # TOTSTOPT the total time stopped for all GCs > # AVGSTOPT the average time of all the GCs > # TOPGCTIMEU1S the highest GC time Under 1 sec this minute > # TOPGCTIMEO1S the highest GC time Over 1 sec this minute > echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S > $TOPGCTIMEO1S >$STATSDIR/GCSTATS > #echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S > $TOPGCTIMEO1S > } > > Prep > GetStats > PrintStats > > then in the zabbix-agentd.conf > add these parameters > > # total number of GC in the last minute > UserParameter=gc-num,cat /opt/stats/GCSTATS |awk '{print $1}' > # total number of GC 0.00[1-9] - 0.0000000 second in the last minute > UserParameter=gc-n3d,cat /opt/stats/GCSTATS |awk '{print $2}' > # total number of GC 0.0[1-9] - 0.009 second in the last minute > UserParameter=gc-n2d,cat /opt/stats/GCSTATS |awk '{print $3}' > # total number of GC 0.[1-9] - 0.09 second in the last minute > UserParameter=gc-n1d,cat /opt/stats/GCSTATS |awk '{print $4}' > # total number of GC [1-9].X seconds in the last minute > UserParameter=gc-no1s,cat /opt/stats/GCSTATS |awk '{print $5}' > # total number of GC OVER 10 seconds in the last minute > UserParameter=gc-no10s,cat /opt/stats/GCSTATS |awk '{print $6}' > # these are all 0.0000000 time > # Total time the JVM was stopped for GC in the last minute > UserParameter=gc-tst,cat /opt/stats/GCSTATS |awk '{print $7}' > # Average time the JVM was stopped for ALL GC in the last minute > UserParameter=gc-ast,cat /opt/stats/GCSTATS |awk '{print $8}' > # Highest Time the JVM was stopped for GC under 1 second > UserParameter=gc-ttu1s,cat /opt/stats/GCSTATS |awk '{print $9}' > # Highest Time the JVM was stopped for GC OVER 1 second > UserParameter=gc-tto1s,cat /opt/stats/GCSTATS |awk '{print $10}' > > you have to confgure zabbix items, triggers and graphs for each of the data > points > > > > On Mon, Mar 18, 2019 at 12:34 PM Erick Erickson <erickerick...@gmail.com> > wrote: > >> Attachments are pretty aggressively stripped by the apache mail server, so >> it didn’t come through. >> >> That said, I’m not sure how much use just the last GC time is. What do you >> want it for? This >> sounds a bit like an XY problem. >> >> Best, >> Erick >> >>> On Mar 17, 2019, at 2:43 PM, Karthik K G <kgkarthi...@gmail.com> wrote: >>> >>> Hi Team, >>> >>> I was looking for Old GC duration time metrics, but all I could find was >> the API for this "/solr/admin/metrics?wt=json&group=jvm&prefix=gc.G1- >> Old-Generation", but I am not sure if this is for >> 'gc_g1_gen_o_lastgc_duration'. I tried to hookup the IP to the jconsole and >> was looking for the metrics, but all I could see was the collection time >> but not last GC duration as attached in the screenshot. Can you please help >> here with finding the correct metrics. I strongly believe we are not >> capturing this information. Please correct me if I am wrong. >>> >>> Thanks & Regards, >>> Karthik >> >>