Yes, dynamic DNS. On Tue, Oct 25, 2022 at 2:17 PM Meaden, Xand <xand.mea...@kcl.ac.uk> wrote:
> The nodes are being removed as they aren't resolving in DNS anymore; are > you using a dynamic system where only active hosts' names resolve? > > Xand > > ------------------------------ > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Joe Teumer <joe.teu...@gmail.com> > *Sent:* Tuesday, October 25, 2022 7:42:16 PM > *To:* slurm-us...@schedmd.com <slurm-us...@schedmd.com> > *Subject:* [slurm-users] slurmctld removing offline nodes > > We noticed that the slurm controller will remove nodes that it cannot > reach. > How can this be disabled? > We would like to see the nodes marked down/drain instead of the controller > removing the nodes from sinfo. > > /var/log/slurm/slurmctld.log > [2022-10-25T13:10:01.500] debug: Log file re-opened > [2022-10-25T13:10:01.589] error: get_addr_info: getaddrinfo() failed: > Temporary failure in name resolution > [2022-10-25T13:10:01.589] error: slurm_set_addr: Unable to resolve > "spg-ethx-f4ce" > [2022-10-25T13:10:01.589] error: slurm_get_port: Address family '0' not > supported > [2022-10-25T13:10:01.589] error: _set_slurmd_addr: failure on spg-ethx-f4ce > > cat /etc/slurm/slurm.conf | grep -i f4ce > NodeName=spg-ethx-f4ce ... > PartitionName=debug spg-ethx-f4ce ... > > No output in sinfo: > sinfo -N | grep f4ce > sinfo -R | grep f4ce > > slurmd -V > slurm 21.08.0 >