FYI, after more internet sleuthing (searching for “juju slurm”) I came across this outstanding looking project: Omnivector Slurm Distribution (OSD): https://omnivector-solutions.github.io/osd-documentation/master/index.html
This project uses Juju (Canonical project) to deploy, configure and manage a Slurm cluster along with a variety of other components, like SlurmREST API, Prometheus integration , log forwarding via Fluentbit to Graylog and others Deployment targets include cloud AWS/Openstack, local LXD, MAAS for baremetal… I’ve only started to play with OSD, but it looks like a great framework for deploying Slurm clusters. Quick install on an Ubuntu 22.04LTS host: sudo snap install juju --classic sudo snap install lxd lxd init --auto lxc network set lxdbr0 ipv6.address none sudo ufw allow 8443/tcp juju bootstrap --show-log localhost Followed by a quick test of sinfo: juju run --unit slurmctld/0 "sinfo" PARTITION AVAIL TIMELIMIT NODES STATE NODELIST osd-slurmd up infinite 1 down* juju-65df3d-2 juju run --unit slurmctld/0 "sinfo -R" REASON USER TIMESTAMP NODELIST New node slurm 2023-03-15T01:21:21 juju-65df3d-2 Mike From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Hanby, Mike <mha...@uab.edu> Date: Wednesday, February 15, 2023 at 1:51 PM To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> Subject: [slurm-users] Running Containerized Slurmctld and Slurmdb in Production? Howdy, Just wondering if any sites are running containerized Slurmctld and Slurmdbd in production? We are in the process of planning migrating from a single host running slurmctld, slurmdbd, and MySQL (and other HPC services) to separate OpenStack VMs. Our site averages less than 1000’s running / pending jobs at any given time. Like many HPC sites, our jobs are a mix of long running, large arrays, very short… I ran across this Github project “Slurm Docker Cluster” https://github.com/giovtorres/slurm-docker-cluster<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgiovtorres%2Fslurm-docker-cluster&data=05%7C01%7Cmhanby%40uab.edu%7C6dd0fbb8a506499d329308db0f85b1f9%7Cd8999fe476af40b3b4351d8977abc08c%7C1%7C0%7C638120839125275887%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Wt%2Fc%2BdpX5xMFtTn47aZOPF%2BELV7H0mb%2Fb4Eib9atgaI%3D&reserved=0> and got me thinking that this method might be great for simpler upgrades, ease of reproducing the cluster in development, etc… How about it, anyone running containerized Slurm server processes in production? Thanks, Mike