Will, I know I will regret chiming in here. Are you able to say what cluster manager or framework you are using? I don't see a problem in running two different distributions. But as Per says look at your development environment.
For my part, I would ask have you thought about containerisation? ie CentOS comoute nodes and run Singularity? ALso the 'unique home directory per node' gives me the heebie-jeebies. I guess technically is it OK. However many commercial packages crate dot files or dit directories in user home directories. I am thinking of things like Ansys and Matlab etc. etc. etc. here What will you do if these dotfiles are not consistent across the cluster? Before anyoen says it, I was arguing somewhere else recently that 'home directories' are an outdated concept when you are running HPC. I still think that, and this is a classic case in point. Forgive me if I have misunderstood your setup. On 25 May 2018 at 11:30, Pär Lindfors <pa...@nsc.liu.se> wrote: > Hi Will, > > On 05/24/2018 05:43 PM, Will Dennis wrote: > > (we were using CentOS 7.x > > originally, now the compute nodes are on Ubuntu 16.04.) Currently, we > > have a single controller (slurmctld) node, an accounting db node> > (slurmdbd), and 10 compute/worker nodes (slurmd.) > > Time to start upgrading to Ubuntu 18.04 now then? :-) > > For a 10 node cluster it might make more sense to run slurmctld and > slurmdbd on the same hardware as neither have very high hardware > requirements. > > On our current clusters we run both services on the same machine. The > main disadvantage of this is that it makes upgrades inconvenient as it > prevents upgrading slurmdbd and slurmctld independently. For future > installations we will probably try running slurmdbd in a VM. > > > The problem is that the controller is still running CentOS 7 with our > > older NFS-mounted /home scheme, but the compute nodes are now all Ubuntu > > 16.04 with local /home fs’s. > > Does each user have a different local home directory on each compute > node? That is not something I would recommend, unless you are very good > at training your users to avoid submitting jobs in their home > directories. I assume you have some other shared file system across the > cluster? > > > 1) Can we leave the current controller machine on C7 OS, and just > > have the users log into other machines (that have the same config as the > > compute nodes) to submit jobs? > > Or should the controller node really be > > on the same OS as the compute nodes for some reason? > > I recommend separating them, for systems administration and user > convenience reasons. > > With users logged into the the same machine that is running your > controller or other cluster services, the users can impact the operation > of the entire cluster when they make mistakes. Typical user mistakes > involves using all CPU resources, using all memory, filling up or > overloading filesystems... Much better to have this happen on dedicated > login machines. > > If the login machine uses a different OS than the worker nodes, users > will also run into problems if they compile software there, as system > library versions won't match what is available on the compute nodes. > > Technically as long as you use the same Slurm version it should work. > You should however check that your Slurm binaries on different OS are > build with the exact same features enabled. Many are enabled at compile > time, so check and compare the output from ./configure. > > > 2) Can I add a backup controller node that runs a different... > > 3) What are the steps to replace a primary controller, given that a > ... > We are not currently using a backup controller, so I can't answer that > part. > > slurmctld keeps it state files in the directory configured as > StateSaveLocation, so for slurmctld you typically only need to save the > configuration files, and that directory. (note this does not include > munge or the slurmdbd) > > Regards, > Pär Lindfors, NSC > >