So, I had a problem when at a customer site. They use zabbix for data collection and alerting.
The solr server had been setup to use only jmx metrics. the jvm was unstable and would lock up for a period of time and the metrics and counters would be all screwed up. Because it was using jmx to alert it was screwing up as the jvm needed to be working to be used. So I turned on gclogging and wrote a script to collect data points about for instance how long the jvm was stopped in the last minute. I eventually got the gc tuned and behaving well but it was difficult. turn on gcloging i use these options -Xloggc:../var/logs/gclog.log \ -XX:+PrintHeapAtGC \ -XX:+PrintGCDetails \ -XX:+PrintGCDateStamps \ -XX:+PrintGCTimeStamps \ -XX:+PrintTenuringDistribution \ -XX:+PrintGCApplicationStoppedTime \ -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 \ -XX:GCLogFileSize=100M in the solr runtme users crontab on each system .... * * * * * nohup /opt/scripts/getstats-gclog & this is the script /opt/scripts/getstats-gclog #/bin/bash -x # # get some statistics # # GC time stamp # 2018-06-27T12:52:57.200+0000 # FIVEMIN=`date --date '5 minutes ago' +'%Y-%m-%dT%H:%M'` FOURMIN=`date --date '4 minutes ago' +'%Y-%m-%dT%H:%M'` THREEMIN=`date --date '3 minutes ago' +'%Y-%m-%dT%H:%M'` TWOMIN=`date --date '2 minutes ago' +'%Y-%m-%dT%H:%M'` ONEMIN=`date --date '1 minute ago' +'%Y-%m-%dT%H:%M'` YEAR=`date --date '1 minute ago' +'%Y'` MONTH=`date --date '1 minute ago' +'%m'` DAY=`date --date '1 minute ago' +'%d'` HOUR=`date --date '1 minute ago' +'%H'` MINUTE=`date --date '1 minute ago' +'%M'` SECOND=`date --date '1 minute ago' +'%S'` # # STATSDIR=/opt/stats/ WORKDIR=/$STATSDIR/working_gc # # LOGDIR=/u01/app/solr/var/logs/ LOGNAME=gclog.log Prep() { mkdir -p $STATSDIR chmod 755 $STATSDIR mkdir -p $WORKDIR chmod 755 $WORKDIR } GetStats() { cd $WORKDIR grep $ONEMIN $LOGDIR/$LOGNAME.*|grep stopped|awk '{print $11}'>ALLGC COUNT=`cat ALLGC |wc -l` # number under .00X XDecimalplaces U3D for example U3D=`grep "^0\.00[1-9]" ALLGC|wc -l` if [ -z $U3D ] then U3D=0 fi U2D=`grep "^0\.0[1-9]" ALLGC|wc -l` if [ -z $U2D ] then U2D=0 fi U1D=`grep "^0\.[1-9]" ALLGC|wc -l` if [ -z $U1D ] then U1D=0 fi O1S=`cat ALLGC | grep -v "^0\."|wc -l` if [ -z $O1S ] then O1S=0 fi O10S=`grep "[0-9]\+[0-9]\.[0-9]*" ALLGC|wc -l` if [ -z $O10S ] then O10S=0 fi cat ALLGC | grep -v "^0\.">OVER1SECDATA TOTAL=0 COUNT=0 while read DAT do TOTAL=`echo "$TOTAL + $DAT"|bc` COUNT=`expr $COUNT + 1` done <$WORKDIR/ALLGC #AO1S=$(printf "%.2f\n" `echo "scale=10;$COUNT/60" | bc`) #AVGQT=$(printf "%.0f\n" `echo "scale=10;$TOTAL/$COUNT"|bc`) TOTSTOP=$TOTAL AVGSTOPT=`echo "scale=7;$TOTAL/$COUNT"|bc` if [ -z $AVGSTOPT ] then AVGSTOPT=0 fi # get top gc times # #echo 0.0000000>ALLGCU1S #echo 0.0000000>ALLGCO1S grep '^0.' $WORKDIR/ALLGC >ALLGCU1S grep -v '^0.' $WORKDIR/ALLGC >ALLGCO1S TOPGCTIMEU1S=`cat $WORKDIR/ALLGCU1S |sort |tail -1` if [ -z $TOPGCTIMEU1S ] then TOPGCTIMEU1S=0 fi TOPGCTIMEO1S=`cat $WORKDIR/ALLGCO1S |sort |tail -1` if [ -z $TOPGCTIMEO1S ] then TOPGCTIMEO1S=0 fi } PrintStats() { # ## stats #COUNT= total number of garbage collection this minute ## U3d = Total number of GC that are under 0.00Xseconds # U2D total number of GC that are under 0.0X seconds # U1D total number of GC that are under 0.X seconds # O1S total number of GC that are over 1 second but under 10 # O10S total number of GC that are over 10 seconds # TOTSTOPT the total time stopped for all GCs # AVGSTOPT the average time of all the GCs # TOPGCTIMEU1S the highest GC time Under 1 sec this minute # TOPGCTIMEO1S the highest GC time Over 1 sec this minute echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S $TOPGCTIMEO1S >$STATSDIR/GCSTATS #echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S $TOPGCTIMEO1S } Prep GetStats PrintStats then in the zabbix-agentd.conf add these parameters # total number of GC in the last minute UserParameter=gc-num,cat /opt/stats/GCSTATS |awk '{print $1}' # total number of GC 0.00[1-9] - 0.0000000 second in the last minute UserParameter=gc-n3d,cat /opt/stats/GCSTATS |awk '{print $2}' # total number of GC 0.0[1-9] - 0.009 second in the last minute UserParameter=gc-n2d,cat /opt/stats/GCSTATS |awk '{print $3}' # total number of GC 0.[1-9] - 0.09 second in the last minute UserParameter=gc-n1d,cat /opt/stats/GCSTATS |awk '{print $4}' # total number of GC [1-9].X seconds in the last minute UserParameter=gc-no1s,cat /opt/stats/GCSTATS |awk '{print $5}' # total number of GC OVER 10 seconds in the last minute UserParameter=gc-no10s,cat /opt/stats/GCSTATS |awk '{print $6}' # these are all 0.0000000 time # Total time the JVM was stopped for GC in the last minute UserParameter=gc-tst,cat /opt/stats/GCSTATS |awk '{print $7}' # Average time the JVM was stopped for ALL GC in the last minute UserParameter=gc-ast,cat /opt/stats/GCSTATS |awk '{print $8}' # Highest Time the JVM was stopped for GC under 1 second UserParameter=gc-ttu1s,cat /opt/stats/GCSTATS |awk '{print $9}' # Highest Time the JVM was stopped for GC OVER 1 second UserParameter=gc-tto1s,cat /opt/stats/GCSTATS |awk '{print $10}' you have to confgure zabbix items, triggers and graphs for each of the data points On Mon, Mar 18, 2019 at 12:34 PM Erick Erickson <erickerick...@gmail.com> wrote: > Attachments are pretty aggressively stripped by the apache mail server, so > it didn’t come through. > > That said, I’m not sure how much use just the last GC time is. What do you > want it for? This > sounds a bit like an XY problem. > > Best, > Erick > > > On Mar 17, 2019, at 2:43 PM, Karthik K G <kgkarthi...@gmail.com> wrote: > > > > Hi Team, > > > > I was looking for Old GC duration time metrics, but all I could find was > the API for this "/solr/admin/metrics?wt=json&group=jvm&prefix=gc.G1- > Old-Generation", but I am not sure if this is for > 'gc_g1_gen_o_lastgc_duration'. I tried to hookup the IP to the jconsole and > was looking for the metrics, but all I could see was the collection time > but not last GC duration as attached in the screenshot. Can you please help > here with finding the correct metrics. I strongly believe we are not > capturing this information. Please correct me if I am wrong. > > > > Thanks & Regards, > > Karthik > >