On 7/23/21 12:29 PM, Riccardo Sucapane wrote:
I am using Slurm as a workload manager on a system
with a master and 3 nodes.
The operating system used is the recent rocky linux 8.4
while for slurm, is used the version 20.11.8 taken from EPEL
repository.
Everything works correctly and when the system is started the command
"systemctl start slurmctld" works fine, but at boot the daemon
slurmctld does not start on the master machine, reporting a series of errors.
Without reporting all the slurmctld.log the recurring error is the following:

[2021-07-23T09:58:01.932] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2021-07-23T09:58:01.932] error: slurm_set_addr: Unable to resolve "blade01"
[2021-07-23T09:58:01.932] error: slurm_get_port: Address family '0' not supported
[2021-07-23T09:58:01.932] error: _set_slurmd_addr: failure on blade01

This seems to be a DNS name resolution error.

This could be due to slurmctld starting before the server's network is completely up! We have seen this with slurmd on EL 8.4 nodes, and I found a solution, see https://bugs.schedmd.com/show_bug.cgi?id=11878#c5. This will be fixed in Slurm 21.08.

In /usr/lib/systemd/system/slurmd.service and /usr/lib/systemd/system/slurmctld.service you should replace "network.target" by "network-online.target". Reboot to test it.

In this case I have set it in the slurm.conf file, for simplicity,
"AccountingStorageType=accounting_storage/none", but also using the
slurmdbd/mariadb support is all right with no problems, but slurmctld
still does not start on boot.
Also in the log reported blade01 is the hostname of one of the nodes.

You should probably fix /usr/lib/systemd/system/slurmdbd.service as well.

/Ole

Reply via email to