Hi Will, On 05/24/2018 05:43 PM, Will Dennis wrote: > (we were using CentOS 7.x > originally, now the compute nodes are on Ubuntu 16.04.) Currently, we > have a single controller (slurmctld) node, an accounting db node> (slurmdbd), > and 10 compute/worker nodes (slurmd.)
Time to start upgrading to Ubuntu 18.04 now then? :-) For a 10 node cluster it might make more sense to run slurmctld and slurmdbd on the same hardware as neither have very high hardware requirements. On our current clusters we run both services on the same machine. The main disadvantage of this is that it makes upgrades inconvenient as it prevents upgrading slurmdbd and slurmctld independently. For future installations we will probably try running slurmdbd in a VM. > The problem is that the controller is still running CentOS 7 with our > older NFS-mounted /home scheme, but the compute nodes are now all Ubuntu > 16.04 with local /home fs’s. Does each user have a different local home directory on each compute node? That is not something I would recommend, unless you are very good at training your users to avoid submitting jobs in their home directories. I assume you have some other shared file system across the cluster? > 1) Can we leave the current controller machine on C7 OS, and just > have the users log into other machines (that have the same config as the > compute nodes) to submit jobs? > Or should the controller node really be > on the same OS as the compute nodes for some reason? I recommend separating them, for systems administration and user convenience reasons. With users logged into the the same machine that is running your controller or other cluster services, the users can impact the operation of the entire cluster when they make mistakes. Typical user mistakes involves using all CPU resources, using all memory, filling up or overloading filesystems... Much better to have this happen on dedicated login machines. If the login machine uses a different OS than the worker nodes, users will also run into problems if they compile software there, as system library versions won't match what is available on the compute nodes. Technically as long as you use the same Slurm version it should work. You should however check that your Slurm binaries on different OS are build with the exact same features enabled. Many are enabled at compile time, so check and compare the output from ./configure. > 2) Can I add a backup controller node that runs a different... > 3) What are the steps to replace a primary controller, given that a ... We are not currently using a backup controller, so I can't answer that part. slurmctld keeps it state files in the directory configured as StateSaveLocation, so for slurmctld you typically only need to save the configuration files, and that directory. (note this does not include munge or the slurmdbd) Regards, Pär Lindfors, NSC