Re: [Beowulf] Introduction and question

Bill Broadley Thu, 28 Feb 2019 00:43:29 -0800

Yes you belong!  Welcome to the list.

There's many different ways to run a cluster.  But my recommendations:


* Making the clusters as identical as possible.

* setup ansible roles for head node, NAS, and compute node

* avoid installing/fixing things with vi/apt-get/dpkg/yum/dnf, use ansible
  whenever possible.  Eventually you'll have to reinstall and it's painful
  to manually apply months of changes.

* Use environment modules, never have users manually running "export
  LD_LIBRARY_PATH=..."

* Use slurm partitions to keep significantly different hardware in different
  pools so users have an easy time of knowing what to run where.

* Set ALL compute nodes to netboot, then configure cobbler to tell them to
  boot from local disk normally.  That way you don't have to manually power on,
  wait for bios, select netboot 30 times to install 30 nodes.

* enable/configure IPMI at least for power on/off (if available).  Write wrapper
  scripts called pon and poff or similar.

* Keep working on getting cobbler+ansible can reinstall a compute node and it
  will power off, enable netboot, power on, pxe install, reboot, run ansible,
  enable automount, and run slurmd.   Write a wrapper script for netboot-enable
  and netboot disable, I used bon and boff.

The above isn't the only way to do it, but it's a reasonable starting point.
It's really nice for users to just be able to browse apps and say "module load
<app>.  As a SysAdmin it's nice to be able to reinstall any wonky nodes and not
have to play the "what X things do I need to do before it can run jobs" game.

Good luck, have fun, and keep us posted.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Introduction and question

Reply via email to