-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi folks,
We run a bunch of health checks [1] on a compute node through Torque [2] and if they fail the node gets knocked offline. One of the checks we do is to check that there are no symbol errors on the IB link. However, I'm wondering if simply saying a single error is too brutal for this - what do other people do about these ? cheers! Chris [1] - for the record we check things like - amount of RAM, failed DIMMs (via IPMI on IBM or memlog on SGI), number of cores, number and speed of CPUs, LDAP OK, home directories accessible, etc. [2] - checks run prior to a job start, after a job exits and every 7.5 minutes (every 10 mom intervals). - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk0GoYcACgkQO2KABBYQAh9w1gCgh19IOhXa5BWOmC3+qyZaDDr/ MrYAn1at4YwaaNkmmZpNAVNHBF0OIH0V =/gDC -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf