If anyone is interested, here is my solution, which seems good enough. Someone will no doubt say there is a neater way!
A shell script which runs ibqueryerrors and returns 1 if anything is found: #!/bin/bash # check for errors on the Infiniband fabric 0 # another script runs for port 1 errors=`/usr/sbin/ibqueryerrors -c -s XmtWait -P0 | tail -n +2` if [ -n "$errors" ] ; then echo "Check for errors on Infiniband Fabric 0" echo echo $errors exit 1 else exit 0 fi For Monit monitoring, exit 0 means the service is OK, exit 1 means there is a problem. So in monit: check program ib0-errors with path "/usr/local/bin/check-ib0.sh" every "30 * * * *" if status == 1 then alert alert my.em...@domain.com with reminder on 30 cycles set mail-format { subject: $DESCRIPTION } (ps. monit is only returning the first line - to be revised) On 19 June 2014 14:18, John Hearns <hear...@googlemail.com> wrote: > Does anyone have good tips on moniroting a cluster for Infiniband errors? > > Specifically Mellanox/OpenFabrics on an SGI cluster. > > I am thinking of running ibcheckerrors or ibqueryerrors and parsing the > output. > > I have Monit set up on the cluster head node > http://mmonit.com/monit/ > > which I find quite good > > Also if individual nodes could use gmetric to report port errors as a > Ganglia metric I have the ganglia-alert script set up to send email if > ganglia values exceed set thresholds. > > Any ideas welcomed please. >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf