No cluster mgr/framework in use... Custom-compiled and packaged the Slurm 16.05.4 release into .rpm/.deb files, and used them to install the different nodes.
Although the homedirs are no longer shared, the nodes do have access to shared storage, one mounted as a subdir of the home directory (which you can symlink stuff from that to the homedir level to “auto-magically” via a conf file that works with a system we designed.) So shared dotfiles, subdirs/files in homedir, etc are all possible. Have not investigated containerized Slurm setup – will have to put that on the exploration list. If the workloads were Dockerized, I’d probably run them via Kubernetes rather than Slurm... -Will From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of John Hearns Sent: Friday, May 25, 2018 5:44 AM To: Slurm User Community List Subject: Re: [slurm-users] Controller / backup controller q's Will, I know I will regret chiming in here. Are you able to say what cluster manager or framework you are using? I don't see a problem in running two different distributions. But as Per says look at your development environment. For my part, I would ask have you thought about containerisation? ie CentOS comoute nodes and run Singularity? ALso the 'unique home directory per node' gives me the heebie-jeebies. I guess technically is it OK. However many commercial packages crate dot files or dit directories in user home directories. I am thinking of things like Ansys and Matlab etc. etc. etc. here What will you do if these dotfiles are not consistent across the cluster? Before anyoen says it, I was arguing somewhere else recently that 'home directories' are an outdated concept when you are running HPC. I still think that, and this is a classic case in point. Forgive me if I have misunderstood your setup. On 25 May 2018 at 11:30, Pär Lindfors <pa...@nsc.liu.se<mailto:pa...@nsc.liu.se>> wrote: Hi Will, On 05/24/2018 05:43 PM, Will Dennis wrote: > (we were using CentOS 7.x > originally, now the compute nodes are on Ubuntu 16.04.) Currently, we > have a single controller (slurmctld) node, an accounting db node> (slurmdbd), > and 10 compute/worker nodes (slurmd.) Time to start upgrading to Ubuntu 18.04 now then? :-) For a 10 node cluster it might make more sense to run slurmctld and slurmdbd on the same hardware as neither have very high hardware requirements. On our current clusters we run both services on the same machine. The main disadvantage of this is that it makes upgrades inconvenient as it prevents upgrading slurmdbd and slurmctld independently. For future installations we will probably try running slurmdbd in a VM. > The problem is that the controller is still running CentOS 7 with our > older NFS-mounted /home scheme, but the compute nodes are now all Ubuntu > 16.04 with local /home fs’s. Does each user have a different local home directory on each compute node? That is not something I would recommend, unless you are very good at training your users to avoid submitting jobs in their home directories. I assume you have some other shared file system across the cluster? > 1) Can we leave the current controller machine on C7 OS, and just > have the users log into other machines (that have the same config as the > compute nodes) to submit jobs? > Or should the controller node really be > on the same OS as the compute nodes for some reason? I recommend separating them, for systems administration and user convenience reasons. With users logged into the the same machine that is running your controller or other cluster services, the users can impact the operation of the entire cluster when they make mistakes. Typical user mistakes involves using all CPU resources, using all memory, filling up or overloading filesystems... Much better to have this happen on dedicated login machines. If the login machine uses a different OS than the worker nodes, users will also run into problems if they compile software there, as system library versions won't match what is available on the compute nodes. Technically as long as you use the same Slurm version it should work. You should however check that your Slurm binaries on different OS are build with the exact same features enabled. Many are enabled at compile time, so check and compare the output from ./configure. > 2) Can I add a backup controller node that runs a different... > 3) What are the steps to replace a primary controller, given that a ... We are not currently using a backup controller, so I can't answer that part. slurmctld keeps it state files in the directory configured as StateSaveLocation, so for slurmctld you typically only need to save the configuration files, and that directory. (note this does not include munge or the slurmdbd) Regards, Pär Lindfors, NSC