I'll throw in my $0.02 since I might be an oddball with how I build
things...
On 03/07/2016 08:43 PM, Jeff Friedman wrote:
Hello all. I am just entering the HPC Sales Engineering role, and
would like to focus my learning on the most relevant stuff. I have
searched near and far for a current survey of some sort listing the
top used “stacks”, but cannot seem to find one that is free. I was
breaking things down similar to this:
_OS disto_: CentOS, Debian, TOSS, etc? I know some come trimmed
down, and also include specific HPC libraries, like CNL, CNK, INK?
CentOS 7. In fact, the base OS for each of my nodes is created with just:
yum groups install "Compute Node" --releasever=7
--installroot=/node_roots/sn2
... which is currently in ZFS and exported via NFSv4.
_MPI options_: MPICH2, MVAPICH2, Open MPI, Intel MPI, ?
All of the above (pretty much whatever our users want us to install).
_Provisioning software_: Cobbler, Warewulf, xCAT, Openstack, Platform
HPC, ?
We started with xCAT but moved away for various reasons. Provisioning is
done without this type of management software in my cluster. I have a
simple Python script to configure a new node's DHCP, PXE boot file, and
NFS export (each node has its own writable root filesystem served to it
via NFS). It's designed to be as simple of an answer to "how can I PXE
boot CentOS?" as I could get.
_Configuration management_: Warewulf, Puppet, Chef, Ansible, ?
SaltStack! This is what does the heavy lifting. Nodes boot with a very
generic CentOS image which only has 1 significant change from stock: a
Salt minion is installed. After a node boots, Salt takes over and
installs software, mounts remote filesystems, cooks dinner, starts
daemons, brings each node into the scheduler, etc. I don't maintain
"node images" I maintain Salt states that do all the work after a node
boots.
_Resource and job schedulers_: I think these are basically the same
thing? Torque, Lava, Maui, Moab, SLURM, Grid Engine, Son of Grid
Engine, Univa, Platform LSF, etc… others?
We briefly used Torque+MOAB before running away crying. We not use SLURM.
_Shared filesystems_: NFS, pNFS, Lustre, GPFS, PVFS2, GlusterFS, ?
NFS (others in the future, we're looking at Ceph at the moment).
_Library management_: Lmod, ?
Lmod.
_Performance monitoring_: Ganglia, Nagios, ?
Ganglia and in the near future, Zabbix.
_Cluster management toolkits_: I believe these perform many of the
functions above, all wrapped up in one tool? Rocks, Oscar, Scyld,
Bright, ?
Does anyone have any observations as to which of the above are the
most common? Or is that too broad? I believe most the clusters I
will be involved with will be in the 128 - 2000 core range, all on
commodity hardware.
Thank you!
- Jeff
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.beowulf.org_mailman_listinfo_beowulf&d=CwICAg&c=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw&r=DhM5WMgdrH-xWhI5BzkRTzoTvz8C-BRZ05t9kW9SXZk&m=DSX_lPBl-ddcSqZRPHfgBks9Qy7i-jNze66bDl8X10k&s=JbG5Mj7EJIXkC58c2hTufeu_GdjiqqNT7h3ubh0Za38&e=
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf