Hi Tom, SPM for SOLR should be helpful here. See http://sematext.com/spm
Otis > On Nov 13, 2015, at 10:00, Tom Evans <tevans...@googlemail.com> wrote: > > Hi all > > We have some issues with our Solr servers spending too much time > paused doing GC. From turning on gc debug, and extracting numbers from > the GC log, we're getting an idea of just how much of a problem. > > I'm currently doing this in a hacky, inefficient way: > > grep -h 'Total time for which application threads were stopped:' solr_gc* \ > | awk '($11 > 0.3) { print $1, $11 }' \ > | sed 's#:.*:##' \ > | sort -n \ > | sum_by_date.py > > (Yes, I really am using sed, grep and awk all in one line. Just wrong :) > > The "sum_by_date.py" program simply adds up all the values with the > same first column, and remembers the largest value seen. This is > giving me the cumulative GC time for extended pauses (over 0.5s), and > the maximum pause seen in a given time period (hourly), eg: > > 2015-11-13T11 119.124037 2.203569 > 2015-11-13T12 184.683309 3.156565 > 2015-11-13T13 65.934526 1.978202 > 2015-11-13T14 63.970378 1.411700 > > > This is fine for seeing that we have a problem. However, really I need > to get this in to our monitoring systems - we use munin. I'm > struggling to work out the best way to extract this information for > our monitoring systems, and I think this might be my naivety about > Java, and working out what should be logged. > > I've turned on JMX debugging, and looking at the different beans > available using jconsole, but I'm drowning in information. What would > be the best thing to monitor? > > Ideally, like the stats above, I'd like to know the cumulative time > spent paused in GC since the last poll, and the longest GC pause that > we see. munin polls every 5 minutes, are there suitable counters > exposed by JMX that it could extract? > > Thanks in advance > > Tom