So,

I had a problem when at a customer site. They use zabbix for data
collection and alerting.

The solr server had been setup to use only jmx metrics.

the jvm was unstable and would lock up for a period of time and the metrics
and counters would be all screwed up. Because it was using jmx to alert it
was screwing up as  the jvm needed to be working to be used.

So I turned on gclogging and wrote a script to collect data points about
for instance how long the jvm was stopped in the last minute.

I eventually got the gc tuned and behaving well but it was difficult.


turn on gcloging

i use these options

-Xloggc:../var/logs/gclog.log \
-XX:+PrintHeapAtGC \
-XX:+PrintGCDetails \
-XX:+PrintGCDateStamps \
-XX:+PrintGCTimeStamps \
-XX:+PrintTenuringDistribution \
-XX:+PrintGCApplicationStoppedTime \
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 \
-XX:GCLogFileSize=100M

in the solr runtme users crontab on each system ....

* * * * * nohup /opt/scripts/getstats-gclog &

this is the script

/opt/scripts/getstats-gclog
#/bin/bash -x
#
# get some statistics
#
# GC time stamp
# 2018-06-27T12:52:57.200+0000
#
FIVEMIN=`date --date '5 minutes ago' +'%Y-%m-%dT%H:%M'`
FOURMIN=`date --date '4 minutes ago' +'%Y-%m-%dT%H:%M'`
THREEMIN=`date --date '3 minutes ago' +'%Y-%m-%dT%H:%M'`
TWOMIN=`date --date '2 minutes ago' +'%Y-%m-%dT%H:%M'`
ONEMIN=`date --date '1 minute ago' +'%Y-%m-%dT%H:%M'`
YEAR=`date --date '1 minute ago' +'%Y'`
MONTH=`date --date '1 minute ago' +'%m'`
DAY=`date --date '1 minute ago' +'%d'`
HOUR=`date --date '1 minute ago' +'%H'`
MINUTE=`date --date '1 minute ago' +'%M'`
SECOND=`date --date '1 minute ago' +'%S'`
#
#
STATSDIR=/opt/stats/
WORKDIR=/$STATSDIR/working_gc
#
#
LOGDIR=/u01/app/solr/var/logs/
LOGNAME=gclog.log
Prep() {
mkdir -p $STATSDIR
chmod 755 $STATSDIR
mkdir -p $WORKDIR
chmod 755 $WORKDIR
}
GetStats() {
cd $WORKDIR
grep $ONEMIN $LOGDIR/$LOGNAME.*|grep stopped|awk '{print $11}'>ALLGC
COUNT=`cat ALLGC |wc -l`
# number under .00X XDecimalplaces U3D for example
U3D=`grep "^0\.00[1-9]" ALLGC|wc -l`
if [ -z $U3D ]
then
  U3D=0
fi
U2D=`grep "^0\.0[1-9]" ALLGC|wc -l`
if [ -z $U2D ]
then
  U2D=0
fi
U1D=`grep "^0\.[1-9]" ALLGC|wc -l`
if [ -z $U1D ]
then
  U1D=0
fi
O1S=`cat ALLGC | grep -v "^0\."|wc -l`
if [ -z $O1S ]
then
  O1S=0
fi
O10S=`grep "[0-9]\+[0-9]\.[0-9]*" ALLGC|wc -l`
if [ -z $O10S ]
then
  O10S=0
fi
cat ALLGC | grep -v "^0\.">OVER1SECDATA

TOTAL=0
COUNT=0
while read DAT
do
TOTAL=`echo "$TOTAL + $DAT"|bc`
COUNT=`expr $COUNT + 1`
done <$WORKDIR/ALLGC
#AO1S=$(printf "%.2f\n"  `echo "scale=10;$COUNT/60" | bc`)
#AVGQT=$(printf "%.0f\n"  `echo "scale=10;$TOTAL/$COUNT"|bc`)
TOTSTOP=$TOTAL
AVGSTOPT=`echo "scale=7;$TOTAL/$COUNT"|bc`
if [ -z $AVGSTOPT ]
then
  AVGSTOPT=0
fi
# get top gc times
#

#echo 0.0000000>ALLGCU1S
#echo 0.0000000>ALLGCO1S

grep '^0.' $WORKDIR/ALLGC >ALLGCU1S
grep -v '^0.' $WORKDIR/ALLGC >ALLGCO1S
TOPGCTIMEU1S=`cat $WORKDIR/ALLGCU1S |sort |tail -1`
if [ -z $TOPGCTIMEU1S ]
then
  TOPGCTIMEU1S=0
fi
TOPGCTIMEO1S=`cat $WORKDIR/ALLGCO1S |sort |tail -1`
if [ -z $TOPGCTIMEO1S ]
then
  TOPGCTIMEO1S=0
fi

}
PrintStats() {
#
## stats
#COUNT= total number of garbage collection this minute
## U3d = Total number of GC that are under 0.00Xseconds
# U2D total number of GC that are under 0.0X seconds
# U1D total number of GC that are under 0.X seconds
# O1S total number of GC that are over 1 second but under 10
# O10S total number of GC that are over 10 seconds
# TOTSTOPT the total time stopped for all GCs
# AVGSTOPT the average time of all the GCs
# TOPGCTIMEU1S the highest GC time Under 1 sec this minute
# TOPGCTIMEO1S the highest GC time Over 1 sec this minute
echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S
$TOPGCTIMEO1S >$STATSDIR/GCSTATS
#echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S
$TOPGCTIMEO1S
}

Prep
GetStats
PrintStats

then in the zabbix-agentd.conf
 add these parameters

# total number of GC in the last minute
UserParameter=gc-num,cat /opt/stats/GCSTATS |awk '{print $1}'
# total number of GC 0.00[1-9] - 0.0000000 second in the last minute
UserParameter=gc-n3d,cat /opt/stats/GCSTATS |awk '{print $2}'
# total number of GC 0.0[1-9] - 0.009 second in the last minute
UserParameter=gc-n2d,cat /opt/stats/GCSTATS |awk '{print $3}'
# total number of GC  0.[1-9] - 0.09 second in the last minute
UserParameter=gc-n1d,cat /opt/stats/GCSTATS |awk '{print $4}'
# total number of GC [1-9].X seconds in the last minute
UserParameter=gc-no1s,cat /opt/stats/GCSTATS |awk '{print $5}'
# total number of GC OVER 10  seconds in the last minute
UserParameter=gc-no10s,cat /opt/stats/GCSTATS |awk '{print $6}'
# these are all 0.0000000 time
# Total time the JVM was stopped for GC in the last minute
UserParameter=gc-tst,cat /opt/stats/GCSTATS |awk '{print $7}'
# Average time the JVM was stopped for ALL GC in the last minute
UserParameter=gc-ast,cat /opt/stats/GCSTATS |awk '{print $8}'
# Highest Time the JVM was stopped for GC under 1 second
UserParameter=gc-ttu1s,cat /opt/stats/GCSTATS |awk '{print $9}'
# Highest Time the JVM was stopped for GC OVER 1 second
UserParameter=gc-tto1s,cat /opt/stats/GCSTATS |awk '{print $10}'

you have to confgure zabbix items, triggers and graphs for each of the data
points



On Mon, Mar 18, 2019 at 12:34 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> Attachments are pretty aggressively stripped by the apache mail server, so
> it didn’t come through.
>
> That said, I’m not sure how much use just the last GC time is. What do you
> want it for? This
> sounds a bit like an XY problem.
>
> Best,
> Erick
>
> > On Mar 17, 2019, at 2:43 PM, Karthik K G <kgkarthi...@gmail.com> wrote:
> >
> > Hi Team,
> >
> > I was looking for Old GC duration time metrics, but all I could find was
> the API for this "/solr/admin/metrics?wt=json&group=jvm&prefix=gc.G1-
> Old-Generation", but I am not sure if this is for
> 'gc_g1_gen_o_lastgc_duration'. I tried to hookup the IP to the jconsole and
> was looking for the metrics, but all I could see was the collection time
> but not last GC duration as attached in the screenshot. Can you please help
> here with finding the correct metrics. I strongly believe we are not
> capturing this information. Please correct me if I am wrong.
> >
> > Thanks & Regards,
> > Karthik
>
>

Reply via email to