Yes you belong! Welcome to the list. There's many different ways to run a cluster. But my recommendations:
* Making the clusters as identical as possible. * setup ansible roles for head node, NAS, and compute node * avoid installing/fixing things with vi/apt-get/dpkg/yum/dnf, use ansible whenever possible. Eventually you'll have to reinstall and it's painful to manually apply months of changes. * Use environment modules, never have users manually running "export LD_LIBRARY_PATH=..." * Use slurm partitions to keep significantly different hardware in different pools so users have an easy time of knowing what to run where. * Set ALL compute nodes to netboot, then configure cobbler to tell them to boot from local disk normally. That way you don't have to manually power on, wait for bios, select netboot 30 times to install 30 nodes. * enable/configure IPMI at least for power on/off (if available). Write wrapper scripts called pon and poff or similar. * Keep working on getting cobbler+ansible can reinstall a compute node and it will power off, enable netboot, power on, pxe install, reboot, run ansible, enable automount, and run slurmd. Write a wrapper script for netboot-enable and netboot disable, I used bon and boff. The above isn't the only way to do it, but it's a reasonable starting point. It's really nice for users to just be able to browse apps and say "module load <app>. As a SysAdmin it's nice to be able to reinstall any wonky nodes and not have to play the "what X things do I need to do before it can run jobs" game. Good luck, have fun, and keep us posted. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf