Hi all,

We are building out a new Slurm cluster for a research group here; 
unfortunately this has taken place over a long period of time, and there's been 
some architectural changes made in the middle, most importantly the host OS on 
the Slurm nodes (we were using CentOS 7.x originally, now the compute nodes are 
on Ubuntu 16.04.) Currently, we have a single controller (slurmctld) node, an 
accounting db node (slurmdbd), and 10 compute/worker nodes (slurmd.)

The problem is that the controller is still running CentOS 7 with our older 
NFS-mounted /home scheme, but the compute nodes are now all Ubuntu 16.04 with 
local /home fs's. Currently (still in testing mode here), the users log into 
the controller node to submit jobs, but of course that's now a completely 
different OS/lib environment than on the compute nodes. (They cannot log into 
the compute nodes unless they have a job currently running on them, as we have 
implemented the 'pam_slurm.so' PAM module on the compute nodes.)

My questions are these:

1)      Can we leave the current controller machine on C7 OS, and just have the 
users log into other machines (that have the same config as the compute nodes) 
to submit jobs? Or should the controller node really be on the same OS as the 
compute nodes for some reason?

2)      Can I add a backup controller node that runs a different environment 
(i.e. like the compute node environment) than the primary controller node? Or 
should (must) it be the same as the primary controller node?

3)      What are the steps to replace a primary controller, given that a backup 
controller exists? (Hopefully this is already documented somewhere that I 
haven't found yet)

Thanks,
Will

Reply via email to