Hi all We have some issues with our Solr servers spending too much time paused doing GC. From turning on gc debug, and extracting numbers from the GC log, we're getting an idea of just how much of a problem.
I'm currently doing this in a hacky, inefficient way: grep -h 'Total time for which application threads were stopped:' solr_gc* \ | awk '($11 > 0.3) { print $1, $11 }' \ | sed 's#:.*:##' \ | sort -n \ | sum_by_date.py (Yes, I really am using sed, grep and awk all in one line. Just wrong :) The "sum_by_date.py" program simply adds up all the values with the same first column, and remembers the largest value seen. This is giving me the cumulative GC time for extended pauses (over 0.5s), and the maximum pause seen in a given time period (hourly), eg: 2015-11-13T11 119.124037 2.203569 2015-11-13T12 184.683309 3.156565 2015-11-13T13 65.934526 1.978202 2015-11-13T14 63.970378 1.411700 This is fine for seeing that we have a problem. However, really I need to get this in to our monitoring systems - we use munin. I'm struggling to work out the best way to extract this information for our monitoring systems, and I think this might be my naivety about Java, and working out what should be logged. I've turned on JMX debugging, and looking at the different beans available using jconsole, but I'm drowning in information. What would be the best thing to monitor? Ideally, like the stats above, I'd like to know the cumulative time spent paused in GC since the last poll, and the longest GC pause that we see. munin polls every 5 minutes, are there suitable counters exposed by JMX that it could extract? Thanks in advance Tom