Re: [Beowulf] Monitoring and reporting Infiniband errors

2014-06-19 Thread John Hearns
pps. I guess I could clear the errors every time this runs, but have decided to just do an initial clear of the errors and look at the cumulative rate. ppps. there is a better list for this chatter, isn't there... On 19 June 2014 15:10, John Hearns wrote: > If anyone is interested, here is my

Re: [Beowulf] Monitoring and reporting Infiniband errors

2014-06-19 Thread John Hearns
If anyone is interested, here is my solution, which seems good enough. Someone will no doubt say there is a neater way! A shell script which runs ibqueryerrors and returns 1 if anything is found: #!/bin/bash # check for errors on the Infiniband fabric 0 # another script runs for port 1 errors=`/

[Beowulf] Monitoring and reporting Infiniband errors

2014-06-19 Thread John Hearns
Does anyone have good tips on moniroting a cluster for Infiniband errors? Specifically Mellanox/OpenFabrics on an SGI cluster. I am thinking of running ibcheckerrors or ibqueryerrors and parsing the output. I have Monit set up on the cluster head node http://mmonit.com/monit/ which I find quite