Really interesting to see what stacks people are using! ----- Original Message ----- > From: "Jeff Friedman" <jeff.fried...@siliconmechanics.com> > To: beowulf@beowulf.org > Sent: Tuesday, 8 March, 2016 06:43:59 > Subject: [Beowulf] Most common cluster management software, job schedulers, > etc?
> Hello all. I am just entering the HPC Sales Engineering role, and would like > to > focus my learning on the most relevant stuff. I have searched near and far for > a current survey of some sort listing the top used “stacks”, but cannot seem > to > find one that is free. I was breaking things down similar to this: > OS disto : CentOS, Debian, TOSS, etc? I know some come trimmed down, and also > include specific HPC libraries, like CNL, CNK, INK? CentOS. That (and RH) has the best coverage for driver support (InfiniBand, Lustre/GPFS, GPU, Xeon Phi) and ISV code compatibility. If this was not an issue then I'd go with Debian. > MPI options : MPICH2, MVAPICH2, Open MPI, Intel MPI, ? IntelMPI, OpenMPI, MVAPICH2 Good to have at least 2 stacks installed if one flakes out with a bug, it's straightforward to try the "secondary" one. > Provisioning software : Cobbler, Warewulf, xCAT, Openstack, Platform HPC, ? > Configuration management : Warewulf, Puppet, Chef, Ansible, ? Using Warewulf but moving towards having it as simple provisioner and using Ansible. Piloting it in a new project. Lots of playbooks available from our github: https://github.com/CSC-IT-Center-for-Science/fgci-ansible (YMMV) We have a pretty big "general IT" server and cloud infrastructure so having a non-HPC specific config management will hopefully create some synergies. > Resource and job schedulers : I think these are basically the same thing? > Torque, Lava, Maui, Moab, SLURM, Grid Engine, Son of Grid Engine, Univa, > Platform LSF, etc… others? Moved everything to SLURM a few years back and not looking back :) Support from SchedMD has been good. > Shared filesystems : NFS, pNFS, Lustre, GPFS, PVFS2, GlusterFS, ? BeeGFS is gaining a lot of traction in the small-medium cluster space it seems. It was also recently open sourced. We use self-supported Lustre and Ceph for cloud. It will be interesting to see how Ceph evolves in the high-performance space. > Library management : Lmod, ? lmod > Performance monitoring : Ganglia, Nagios, ? - Collectd/Graphite/Grafana for system/infra metrics, - Nagios, OpsView for Nagios GUI (might move to Icinga or Sensu at some point), - ELK for log analytics - OpenXDMoD for queue monitoring (looking at using SupreMM at the moment) - Allinea Performance Reports for per-job analysis > Cluster management toolkits : I believe these perform many of the functions > above, all wrapped up in one tool? Rocks, Oscar, Scyld, Bright, ? > Does anyone have any observations as to which of the above are the most > common? > Or is that too broad? I believe most the clusters I will be involved with will > be in the 128 - 2000 core range, all on commodity hardware. > Thank you! > - Jeff > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf