If anyone is interested, here is my solution, which seems good enough.
Someone will no doubt say there is a neater way!

A shell script which runs ibqueryerrors and returns 1 if anything is found:

#!/bin/bash
# check for errors on the Infiniband fabric 0
# another script runs for port 1

errors=`/usr/sbin/ibqueryerrors -c -s XmtWait -P0 | tail -n +2`
if [ -n "$errors" ] ; then
   echo "Check for errors on Infiniband Fabric 0"
   echo
   echo $errors
   exit 1
else
   exit 0
fi

For Monit monitoring, exit 0 means the service is OK, exit 1 means there is
a problem.

So in monit:

check program ib0-errors with path "/usr/local/bin/check-ib0.sh"
   every "30 * * * *"
   if status == 1 then alert
   alert my.em...@domain.com with reminder on 30 cycles
   set mail-format { subject: $DESCRIPTION }



(ps. monit is only returning the first line - to be revised)



On 19 June 2014 14:18, John Hearns <hear...@googlemail.com> wrote:

> Does anyone have good tips on moniroting a cluster for Infiniband errors?
>
> Specifically Mellanox/OpenFabrics on an SGI cluster.
>
> I am thinking of running ibcheckerrors or ibqueryerrors and parsing the
> output.
>
> I have Monit set up on the cluster head node
> http://mmonit.com/monit/
>
> which I find quite good
>
> Also if individual nodes could use gmetric to report port errors as a
> Ganglia metric I have the ganglia-alert script set up to send email if
> ganglia values exceed set thresholds.
>
> Any ideas welcomed please.
>
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to