GCViewer will work on the GC logging file created by Solr. It has some nice 
summaries, particularly of stop-the-world GC events.

From there, you can pinpoint the exact times from manual inspection of the GC 
log. It usually looks like this:

Your replica went into recovery when again? sometime between 14:00 and 15:00.

GCViewer: Oh, look. Sometime between 14:30 and 14:35 there was a 20 second 
stop-the-world GC pause. 

Ok, back to the Solr log and I see “leader initiated recovery” sent by the 
leader to this replica at 14:32. Oh look, the GC log shows the stop-the-world 
GC pause starting at 14:31:45. Let's tune GC.

That’s if you want to pin it exactly, but usually it looks more like this:

Oh, look. GCViewer shows long stop-the-world GC pauses about the time your 
replica went into recovery, let’s start tuning GC.

There are other products that give you a good way to navigate the GC events, 
GCViewer is free though.

Best,
Erick

> On Mar 18, 2019, at 10:17 AM, Jeff Courtade <courtadej...@gmail.com> wrote:
> 
> So,
> 
> I had a problem when at a customer site. They use zabbix for data
> collection and alerting.
> 
> The solr server had been setup to use only jmx metrics.
> 
> the jvm was unstable and would lock up for a period of time and the metrics
> and counters would be all screwed up. Because it was using jmx to alert it
> was screwing up as  the jvm needed to be working to be used.
> 
> So I turned on gclogging and wrote a script to collect data points about
> for instance how long the jvm was stopped in the last minute.
> 
> I eventually got the gc tuned and behaving well but it was difficult.
> 
> 
> turn on gcloging
> 
> i use these options
> 
> -Xloggc:../var/logs/gclog.log \
> -XX:+PrintHeapAtGC \
> -XX:+PrintGCDetails \
> -XX:+PrintGCDateStamps \
> -XX:+PrintGCTimeStamps \
> -XX:+PrintTenuringDistribution \
> -XX:+PrintGCApplicationStoppedTime \
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 \
> -XX:GCLogFileSize=100M
> 
> in the solr runtme users crontab on each system ....
> 
> * * * * * nohup /opt/scripts/getstats-gclog &
> 
> this is the script
> 
> /opt/scripts/getstats-gclog
> #/bin/bash -x
> #
> # get some statistics
> #
> # GC time stamp
> # 2018-06-27T12:52:57.200+0000
> #
> FIVEMIN=`date --date '5 minutes ago' +'%Y-%m-%dT%H:%M'`
> FOURMIN=`date --date '4 minutes ago' +'%Y-%m-%dT%H:%M'`
> THREEMIN=`date --date '3 minutes ago' +'%Y-%m-%dT%H:%M'`
> TWOMIN=`date --date '2 minutes ago' +'%Y-%m-%dT%H:%M'`
> ONEMIN=`date --date '1 minute ago' +'%Y-%m-%dT%H:%M'`
> YEAR=`date --date '1 minute ago' +'%Y'`
> MONTH=`date --date '1 minute ago' +'%m'`
> DAY=`date --date '1 minute ago' +'%d'`
> HOUR=`date --date '1 minute ago' +'%H'`
> MINUTE=`date --date '1 minute ago' +'%M'`
> SECOND=`date --date '1 minute ago' +'%S'`
> #
> #
> STATSDIR=/opt/stats/
> WORKDIR=/$STATSDIR/working_gc
> #
> #
> LOGDIR=/u01/app/solr/var/logs/
> LOGNAME=gclog.log
> Prep() {
> mkdir -p $STATSDIR
> chmod 755 $STATSDIR
> mkdir -p $WORKDIR
> chmod 755 $WORKDIR
> }
> GetStats() {
> cd $WORKDIR
> grep $ONEMIN $LOGDIR/$LOGNAME.*|grep stopped|awk '{print $11}'>ALLGC
> COUNT=`cat ALLGC |wc -l`
> # number under .00X XDecimalplaces U3D for example
> U3D=`grep "^0\.00[1-9]" ALLGC|wc -l`
> if [ -z $U3D ]
> then
>  U3D=0
> fi
> U2D=`grep "^0\.0[1-9]" ALLGC|wc -l`
> if [ -z $U2D ]
> then
>  U2D=0
> fi
> U1D=`grep "^0\.[1-9]" ALLGC|wc -l`
> if [ -z $U1D ]
> then
>  U1D=0
> fi
> O1S=`cat ALLGC | grep -v "^0\."|wc -l`
> if [ -z $O1S ]
> then
>  O1S=0
> fi
> O10S=`grep "[0-9]\+[0-9]\.[0-9]*" ALLGC|wc -l`
> if [ -z $O10S ]
> then
>  O10S=0
> fi
> cat ALLGC | grep -v "^0\.">OVER1SECDATA
> 
> TOTAL=0
> COUNT=0
> while read DAT
> do
> TOTAL=`echo "$TOTAL + $DAT"|bc`
> COUNT=`expr $COUNT + 1`
> done <$WORKDIR/ALLGC
> #AO1S=$(printf "%.2f\n"  `echo "scale=10;$COUNT/60" | bc`)
> #AVGQT=$(printf "%.0f\n"  `echo "scale=10;$TOTAL/$COUNT"|bc`)
> TOTSTOP=$TOTAL
> AVGSTOPT=`echo "scale=7;$TOTAL/$COUNT"|bc`
> if [ -z $AVGSTOPT ]
> then
>  AVGSTOPT=0
> fi
> # get top gc times
> #
> 
> #echo 0.0000000>ALLGCU1S
> #echo 0.0000000>ALLGCO1S
> 
> grep '^0.' $WORKDIR/ALLGC >ALLGCU1S
> grep -v '^0.' $WORKDIR/ALLGC >ALLGCO1S
> TOPGCTIMEU1S=`cat $WORKDIR/ALLGCU1S |sort |tail -1`
> if [ -z $TOPGCTIMEU1S ]
> then
>  TOPGCTIMEU1S=0
> fi
> TOPGCTIMEO1S=`cat $WORKDIR/ALLGCO1S |sort |tail -1`
> if [ -z $TOPGCTIMEO1S ]
> then
>  TOPGCTIMEO1S=0
> fi
> 
> }
> PrintStats() {
> #
> ## stats
> #COUNT= total number of garbage collection this minute
> ## U3d = Total number of GC that are under 0.00Xseconds
> # U2D total number of GC that are under 0.0X seconds
> # U1D total number of GC that are under 0.X seconds
> # O1S total number of GC that are over 1 second but under 10
> # O10S total number of GC that are over 10 seconds
> # TOTSTOPT the total time stopped for all GCs
> # AVGSTOPT the average time of all the GCs
> # TOPGCTIMEU1S the highest GC time Under 1 sec this minute
> # TOPGCTIMEO1S the highest GC time Over 1 sec this minute
> echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S
> $TOPGCTIMEO1S >$STATSDIR/GCSTATS
> #echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S
> $TOPGCTIMEO1S
> }
> 
> Prep
> GetStats
> PrintStats
> 
> then in the zabbix-agentd.conf
> add these parameters
> 
> # total number of GC in the last minute
> UserParameter=gc-num,cat /opt/stats/GCSTATS |awk '{print $1}'
> # total number of GC 0.00[1-9] - 0.0000000 second in the last minute
> UserParameter=gc-n3d,cat /opt/stats/GCSTATS |awk '{print $2}'
> # total number of GC 0.0[1-9] - 0.009 second in the last minute
> UserParameter=gc-n2d,cat /opt/stats/GCSTATS |awk '{print $3}'
> # total number of GC  0.[1-9] - 0.09 second in the last minute
> UserParameter=gc-n1d,cat /opt/stats/GCSTATS |awk '{print $4}'
> # total number of GC [1-9].X seconds in the last minute
> UserParameter=gc-no1s,cat /opt/stats/GCSTATS |awk '{print $5}'
> # total number of GC OVER 10  seconds in the last minute
> UserParameter=gc-no10s,cat /opt/stats/GCSTATS |awk '{print $6}'
> # these are all 0.0000000 time
> # Total time the JVM was stopped for GC in the last minute
> UserParameter=gc-tst,cat /opt/stats/GCSTATS |awk '{print $7}'
> # Average time the JVM was stopped for ALL GC in the last minute
> UserParameter=gc-ast,cat /opt/stats/GCSTATS |awk '{print $8}'
> # Highest Time the JVM was stopped for GC under 1 second
> UserParameter=gc-ttu1s,cat /opt/stats/GCSTATS |awk '{print $9}'
> # Highest Time the JVM was stopped for GC OVER 1 second
> UserParameter=gc-tto1s,cat /opt/stats/GCSTATS |awk '{print $10}'
> 
> you have to confgure zabbix items, triggers and graphs for each of the data
> points
> 
> 
> 
> On Mon, Mar 18, 2019 at 12:34 PM Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> Attachments are pretty aggressively stripped by the apache mail server, so
>> it didn’t come through.
>> 
>> That said, I’m not sure how much use just the last GC time is. What do you
>> want it for? This
>> sounds a bit like an XY problem.
>> 
>> Best,
>> Erick
>> 
>>> On Mar 17, 2019, at 2:43 PM, Karthik K G <kgkarthi...@gmail.com> wrote:
>>> 
>>> Hi Team,
>>> 
>>> I was looking for Old GC duration time metrics, but all I could find was
>> the API for this "/solr/admin/metrics?wt=json&group=jvm&prefix=gc.G1-
>> Old-Generation", but I am not sure if this is for
>> 'gc_g1_gen_o_lastgc_duration'. I tried to hookup the IP to the jconsole and
>> was looking for the metrics, but all I could see was the collection time
>> but not last GC duration as attached in the screenshot. Can you please help
>> here with finding the correct metrics. I strongly believe we are not
>> capturing this information. Please correct me if I am wrong.
>>> 
>>> Thanks & Regards,
>>> Karthik
>> 
>> 

Reply via email to