Hi all

We have some issues with our Solr servers spending too much time
paused doing GC. From turning on gc debug, and extracting numbers from
the GC log, we're getting an idea of just how much of a problem.

I'm currently doing this in a hacky, inefficient way:

grep -h 'Total time for which application threads were stopped:' solr_gc* \
    | awk '($11 > 0.3) { print $1, $11 }' \
    | sed 's#:.*:##' \
    | sort -n \
    | sum_by_date.py

(Yes, I really am using sed, grep and awk all in one line. Just wrong :)

The "sum_by_date.py" program simply adds up all the values with the
same first column, and remembers the largest value seen. This is
giving me the cumulative GC time for extended pauses (over 0.5s), and
the maximum pause seen in a given time period (hourly), eg:

2015-11-13T11 119.124037 2.203569
2015-11-13T12 184.683309 3.156565
2015-11-13T13 65.934526 1.978202
2015-11-13T14 63.970378 1.411700


This is fine for seeing that we have a problem. However, really I need
to get this in to our monitoring systems - we use munin. I'm
struggling to work out the best way to extract this information for
our monitoring systems, and I think this might be my naivety about
Java, and working out what should be logged.

I've turned on JMX debugging, and looking at the different beans
available using jconsole, but I'm drowning in information. What would
be the best thing to monitor?

Ideally, like the stats above, I'd like to know the cumulative time
spent paused in GC since the last poll, and the longest GC pause that
we see. munin polls every 5 minutes, are there suitable counters
exposed by JMX that it could extract?

Thanks in advance

Tom

Reply via email to