I'll apologize because I don't have a complete answer. I'm not sure why that doesn't work, but my understanding of how it should work for failover scenarios is a "SlurmctldHost" line for each of the controllers, e.g.:
SlurmctldHost=host1 SlurmctldHost=host2 ... The list format seems to be used in some other scenario I don't completely understand. We're using the multiple lines for our HA arrangement and it seems to be working OK. - Michael On Wed, Dec 13, 2023 at 12:18 PM Jackson, Gary L. <gary.jack...@jhuapl.edu> wrote: > The SlurmctldHost value is set like the following in my slurm.conf: > > > > SlurmctldHost=host0,host1 > > > > That seems to be legal according to the documentation. However, I get > error messages like the following: > > > > $ srun id > > srun: error: get_addr_info: getaddrinfo() failed: Name or service not known > > srun: error: slurm_set_addr: Unable to resolve "host0,host1" > > srun: error: Unable to establish control machine address > > srun: error: Unable to allocate resources: Address already in use > > > > If I try to put IP addresses in parentheses per the documentation, I get > different errors: > > > > $ srun id > > srun: error: Bad value "host0(12.34.56.78),host1" for SlurmctldHost > > srun: error: No SlurmctldHost defined. > > srun: fatal: Unable to process configuration file > > > > If I put a single hostname, or a hostname with an address in parentheses > as the value for SlurmctldHost, it works fine but I have no failover. > > > > I’m running 23.02.6: > > > > $ sinfo --version > > slurm 23.02.6 > > > > What’s going on? > > > > -- > > Gary >