UGE is used in over thousands of nodes, health checks are done via load sensors, a SGE/UGE feature. however i am not aware of any public repo for shared health checks. as for overheat, in one cluster it was done at the bios/firmware level by asking the vendor for certain thresholds to shut the machine off. and it is logged in the syslog logs
On 11 January 2014 00:22, Adam DeConinck <ajde...@ajdecon.org> wrote: > Hi Reza, > > The "common stack" seems to vary depending on what industry you're looking > at. For example, Grid Engine seems to be a really popular job scheduler in > bioinformatics, even though I get the impression that it's on the way out > in a lot of other industries. > > I think most cluster management tools are fairly mature right now. Some > are more actively developed than others, but I don't think "what's hot" is > necessarily a good way to choose your tools. > > More important is whether someone on your team is familiar with those > tools, or with the languages they're written in; or whether you can get > support easily if you don't have expertise yourself. > > For what it's worth, my current "favorites" for scheduling and monitoring > include: > > * Job scheduler: SLURM > * Light-weight health checks between jobs: Warewulf NHC > * Detailed performance monitoring: Ganglia > > Neither NHC or Ganglia do temperature monitoring out-of-the-box (last I > checked), but they're both really easy to extend with something as easy as > bash scripts. > > Adam > > > > On Fri, Jan 10, 2014 at 12:36 PM, reza azimi <reza.c.az...@gmail.com>wrote: > >> hello guys, >> >> I'm looking for a state of art job scheduler and health monitoring for my >> beowulf cluster and due to my research I've found many of them which made >> me confused. Can you help or recommend me the ones which are very hot and >> they are using in industry? >> I have lm-sensors package on my servers and wanna a health monitoring >> program which record the temp as well, all I found are mainly record >> resource utilization. >> Our workload are mainly MPI based benchmarks and we want to test some >> hadoop benchmarks in future. >> >> >> Regards >> Reza >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf