On 05/09/17 15:24, Stu Midgley wrote: > I am in the process of redeveloping our cluster deployment and config > management environment and wondered what others are doing?
xCAT here for all HPC related infrastructure. Stateful installs for GPFS NSD servers and TSM servers, compute nodes are all statelite, so a immutable RAMdisk image is built on the management node for the compute cluster and then on boot they mount various items over NFS (including the GPFS state directory). Nothing like your scale, of course, but it works and we know if a node has booted a particular image it will be identical to any other node that's set to boot the same image. Healthcheck scripts mark nodes offline if they don't have the current production kernel and GPFS versions (and other checks too of course) plus Slurm's "scontrol reboot" lets us do rolling reboots without needing to spot when nodes have become idle. I've got to say I really prefer this to systems like Puppet, Salt, etc, where you need to go and tweak an image after installation. For our VM infrastructure (web servers, etc) we do use Salt for that. We used to use Puppet but we switched when the only person who understood it left. Don't miss it at all... cheers, Chris -- Christopher Samuel Senior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf