Package: slurmctld Version: 20.11.4-1 Severity: normal -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
I have a slurm cluster set up on a single node. This node is running slurmctld, munge, and slurmd. When I reboot the node it seems that there is some race condition with slurmctld and/or slurmd trying to restart before networking is fully available. By the time I can ssh into the machine manually restarting slurmctld and slurmd works. I replaced "localhost" with "127.0.0.1", but that does not seem to change anything. slurmctld.log has [2021-03-10T07:13:08.118] slurmctld version 20.11.4 started on cluster cluster [2021-03-10T07:13:08.132] No memory enforcing mechanism configured. [2021-03-10T07:13:08.137] error: get_addr_info: getaddrinfo() failed: Name or service not known [2021-03-10T07:13:08.137] error: slurm_set_addr: Unable to resolve "127.0.0.1" [2021-03-10T07:13:08.137] error: slurm_get_port: Address family '0' not supported [2021-03-10T07:13:08.137] error: _set_slurmd_addr: failure on 127.0.0.1 [2021-03-10T07:13:08.137] Recovered state of 1 nodes [2021-03-10T07:13:08.138] Recovered JobId=1651 Assoc=0 [2021-03-10T07:13:08.138] Recovered information about 1 jobs [2021-03-10T07:13:08.138] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions [2021-03-10T07:13:08.140] Recovered state of 0 reservations [2021-03-10T07:13:08.140] read_slurm_conf: backup_controller not specified [2021-03-10T07:13:08.140] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure [2021-03-10T07:13:08.140] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions [2021-03-10T07:13:08.141] Running as primary controller [2021-03-10T07:13:08.141] No parameter for mcs plugin, default values set [2021-03-10T07:13:08.141] mcs: MCSParameters = (null). ondemand set. [2021-03-10T07:13:08.142] error: get_addr_info: getaddrinfo() failed: Name or service not known [2021-03-10T07:13:08.142] error: slurm_set_addr: Unable to resolve "(null)" [2021-03-10T07:13:08.142] error: slurm_set_port: attempting to set port without address family [2021-03-10T07:13:08.144] error: Error creating slurm stream socket: Address family not supported by protocol [2021-03-10T07:13:08.144] fatal: slurm_init_msg_engine_port error Address family not supported by protocol slurmd.log has [2021-03-10T07:13:08.195] cgroup namespace 'freezer' is now mounted [2021-03-10T07:13:08.198] slurmd version 20.11.4 started [2021-03-10T07:13:08.199] error: get_addr_info: getaddrinfo() failed: Name or service not known [2021-03-10T07:13:08.199] error: slurm_set_addr: Unable to resolve "(null)" [2021-03-10T07:13:08.199] error: slurm_set_port: attempting to set port without address family [2021-03-10T07:13:08.200] error: Error creating slurm stream socket: Address family not supported by protocol [2021-03-10T07:13:08.200] error: Unable to bind listen port (6818): Address family not supported by protocol - -- System Information: Debian Release: bullseye/sid APT prefers unstable-debug APT policy: (500, 'unstable-debug'), (500, 'testing-security'), (500, 'testing-proposed-updates-debug'), (500, 'testing-debug'), (500, 'testing') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 5.10.0-3-amd64 (SMP w/8 CPU threads) Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE Locale: LANG=en_CA.UTF-8, LC_CTYPE=en_CA.UTF-8 (charmap=UTF-8), LANGUAGE=en_CA:en Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages slurmctld depends on: ii libc6 2.31-9 ii lsb-base 11.1.0 pn munge <none> pn slurm-client <none> pn slurm-wlm-basic-plugins <none> ii ucf 3.0043 slurmctld recommends no packages. slurmctld suggests no packages. -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEkiyHYXwaY0SiY6fqA0U5G1WqFSEFAmBItjwACgkQA0U5G1Wq FSETBBAAozRM+8NBZYZjdMLJ09KdIXvpOzk7CDgnV1NQTetm+rZxJ1pNpir1fbIz gzFxIlvjropFD42UJhXI1IkJa5OEoiCrlKCvwJflBdZ2Ap1Qjl/j/vWQRotr+CYk By5I9Ason/iEEEe3TRVu2Gvs6LsB+92N4JKblpYb8Wn33P7XX4boy9/uKhmtpkDj sQ4QAP95f+VTsMn/R36e1y3ktRvos0Ao9FAyzorPpDsyjgatN1aBYLfrJI+GSDzP +Y38vLMcE1wkmP34H8IFmoHuHXkMrNJL8h4lzcMf2YpL2FSya/pJxcoyoRNnCz0h tMVu2PsHWVFEWat7cQICoyDUZmdNMa396oeoPOOrh7seLwFWBRU8TRVo3+YaXDgp oKFENCA70Xrptk48No81uKPl2uwdxcpaApecu9IYFVA7W0Tk4VlXO2LZ83VW6z3V opAzyDQ1lJ9uGpvIQu+gMvDTbVFpdyZd7nrZylsilGqIUecaBEHAfnai73trPziY KI/7Xwu7ipXOWrLKmWvuyMdZfvvjaGJso4S60C1YDqrI3x+G/HJKqLUMw2VRXl6r BHOy88D1qIB3v9JxMtlW8kGQRJ4PZo79vG5vmCzKocU5jUhIclAVr2jgcOsRmHuU vAeCTW5CuFMwQzJxHq+d6GIBg9CQi6yxHn15UBaXrxUUWth/tO0= =hABj -----END PGP SIGNATURE-----
SlurmctldHost=simplex(127.0.0.1) MpiDefault=none ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm/slurmctld SwitchType=switch/none TaskPlugin=task/affinity InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory AccountingStorageType=accounting_storage/none AccountingStoreJobComment=YES ClusterName=cluster JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log NodeName=simplex NodeAddr=127.0.0.1 CPUs=80 RealMemory=385570 CoresPerSocket=20 ThreadsPerCore=2 State=UNKNOWN PartitionName=login Nodes=simplex Default=YES MaxTime=8:00:00 DefMemPerCPU=1024 MaxMemPerCPU=2048 State=UP PartitionName=long Nodes=simplex Default=NO MaxTime=120:00:00 DefMemPerCPU=2048 MaxMemPerCPU=4096 MaxCPUsPerNode=40 State=UP PartitionName=big Nodes=simplex Default=NO MaxTime=24:00:00 MaxCPUsPerNode=80 DefMemPerCpu=4096 State=UP PartitionName=cron Nodes=simplex Default=NO MaxTime=2:00:00 MaxCPUsPerNode=2 MaxMemPerCPU=1024 State=UP