On 01/10/2014 12:36 PM, reza azimi wrote: > hello guys, > > I'm looking for a state of art job scheduler and health monitoring for > my beowulf cluster and due to my research I've found many of them which > made me confused. Can you help or recommend me the ones which are very > hot and they are using in industry? > I have lm-sensors package on my servers and wanna a health monitoring > program which record the temp as well, all I found are mainly record > resource utilization. > Our workload are mainly MPI based benchmarks and we want to test some > hadoop benchmarks in future.
Our solution with Grid Engine is to have a cron job monitoring the contents of the IPMI SEL. If any messages are in the SEL that are not on a whitelist, a file in /var gets generated (conversely, if no messages are in the SEL, the file gets removed). We have a GE load sensor that monitors for the presence of this file and places that node in an alarm state when it sees this file, preventing new jobs from being scheduled on the node. We then have Nagios monitoring the output of "qstat -xml" on the scheduler nodes so we get notified of when a node goes into an alarm state. Skylar _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf