Hi;
I dont know the problem is this, but, I think the setting
"ControlMachine=localhost" and not setting a hostname for slurm master
node are not good decisions. How compute nodes decide the ip address of
the slurm masternode from "localhost". Also, I suggest not using capital
letters for any thing related the slurm.
Ahmet M.
15.12.2020 21:15 tarihinde Avery Grieve yazdı:
I changed my .service file to write to a log. The slurm daemons are
running (manual start) on the compute nodes. I get this on startup
with the service enabled:
[2020-12-15T18:09:06.412] slurmctld version 20.11.1 started on cluster
cluster
[2020-12-15T18:09:06.539] No memory enforcing mechanism configured.
[2020-12-15T18:09:06.572] error: get_addr_info: getaddrinfo() failed:
Name or service not known
[2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
"FireNode1"
[2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
not supported
[2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on FireNode1
[2020-12-15T18:09:06.573] error: get_addr_info: getaddrinfo() failed:
Name or service not known
[2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
"FireNode2"
[2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
not supported
[2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on FireNode2
[2020-12-15T18:09:06.573] error: get_addr_info: getaddrinfo() failed:
Name or service not known
[2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve
"FireNode3"
[2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0'
not supported
[2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on FireNode3
[2020-12-15T18:09:06.578] Recovered state of 3 nodes
[2020-12-15T18:09:06.579] Recovered information about 0 jobs
[2020-12-15T18:09:06.582] Recovered state of 0 reservations
[2020-12-15T18:09:06.582] read_slurm_conf: backup_controller not specified
[2020-12-15T18:09:06.583] Running as primary controller
[2020-12-15T18:09:06.592] No parameter for mcs plugin, default values set
[2020-12-15T18:09:06.592] mcs: MCSParameters = (null). ondemand set.
[2020-12-15T18:09:06.595] error: get_addr_info: getaddrinfo() failed:
Name or service not known
[2020-12-15T18:09:06.595] error: slurm_set_addr: Unable to resolve
"(null)"
[2020-12-15T18:09:06.595] error: slurm_set_port: attempting to set
port without address family
[2020-12-15T18:09:06.603] error: Error creating slurm stream socket:
Address family not supported by protocol
[2020-12-15T18:09:06.603] fatal: slurm_init_msg_engine_port error
Address family not supported by protocol
The main errors seem to be issues resolving host names and not being
able to set the port. My /etc/hosts file defines the FireNode[1-3]
host IPs and does not contain any IPv6 ips.
My service file includes a clause for "after network-online.target" as
well.
Now, I start the daemon with "systemctl start slurmctld" and end up
with the following log:
[2020-12-15T18:14:03.448] slurmctld version 20.11.1 started on cluster
cluster
[2020-12-15T18:14:03.456] No memory enforcing mechanism configured.
[2020-12-15T18:14:03.465] Recovered state of 3 nodes
[2020-12-15T18:14:03.465] Recovered information about 0 jobs
[2020-12-15T18:14:03.465] Recovered state of 0 reservations
[2020-12-15T18:14:03.466] read_slurm_conf: backup_controller not specified
[2020-12-15T18:14:03.466] Running as primary controller
[2020-12-15T18:14:03.466] No parameter for mcs plugin, default values set
[2020-12-15T18:14:03.466] mcs: MCSParameters = (null). ondemand set.
As you can see, it starts up fine. Seems like something is wrong
during the initial startup network stack configuration or something.
I'm not really sure where to look to begin troubleshooting these. A
bit of googling hasn't revealed much either unfortunately.
Any advice?
~Avery Grieve
They/Them/Theirs please!
University of Michigan
On Tue, Dec 15, 2020 at 11:53 AM Avery Grieve <agri...@umich.edu
<mailto:agri...@umich.edu>> wrote:
Maybe a silly question, but where do you find the daemon logs or
specify their location?
~Avery Grieve
They/Them/Theirs please!
University of Michigan
On Mon, Dec 14, 2020 at 7:22 PM Alpha Experiment
<projectalpha...@gmail.com <mailto:projectalpha...@gmail.com>> wrote:
Hi,
I am trying to run slurm on Fedora 33. Upon boot the slurmd
daemon is running correctly; however the slurmctld daemon
always errors.
[admin@localhost ~]$ systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/etc/systemd/system/slurmd.service;
enabled; vendor preset: disabled)
Active: active (running) since Mon 2020-12-14 16:02:18
PST; 11min ago
Main PID: 2363 (slurmd)
Tasks: 2
Memory: 3.4M
CPU: 211ms
CGroup: /system.slice/slurmd.service
└─2363 /usr/local/sbin/slurmd -D
Dec 14 16:02:18 localhost.localdomain systemd[1]: Started
Slurm node daemon.
[admin@localhost ~]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/etc/systemd/system/slurmctld.service;
enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
└─override.conf
Active: failed (Result: exit-code) since Mon 2020-12-14
16:02:12 PST; 11min ago
Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
$SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 1972 (code=exited, status=1/FAILURE)
CPU: 21ms
Dec 14 16:02:12 localhost.localdomain systemd[1]: Started
Slurm controller daemon.
Dec 14 16:02:12 localhost.localdomain systemd[1]:
slurmctld.service: Main process exited, code=exited,
status=1/FAILURE
Dec 14 16:02:12 localhost.localdomain systemd[1]:
slurmctld.service: Failed with result 'exit-code'.
The slurmctld log is as follows:
[2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on
cluster cluster
[2020-12-14T16:02:12.739] No memory enforcing mechanism
configured.
[2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo()
failed: Name or service not known
[2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to
resolve "localhost"
[2020-12-14T16:02:12.772] error: slurm_get_port: Address
family '0' not supported
[2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on
localhost
[2020-12-14T16:02:12.772] Recovered state of 1 nodes
[2020-12-14T16:02:12.772] Recovered information about 0 jobs
[2020-12-14T16:02:12.772] select/cons_tres:
part_data_create_array: select/cons_tres: preparing for 1
partitions
[2020-12-14T16:02:12.779] Recovered state of 0 reservations
[2020-12-14T16:02:12.779] read_slurm_conf: backup_controller
not specified
[2020-12-14T16:02:12.779] select/cons_tres:
select_p_reconfigure: select/cons_tres: reconfigure
[2020-12-14T16:02:12.779] select/cons_tres:
part_data_create_array: select/cons_tres: preparing for 1
partitions
[2020-12-14T16:02:12.779] Running as primary controller
[2020-12-14T16:02:12.780] No parameter for mcs plugin, default
values set
[2020-12-14T16:02:12.780] mcs: MCSParameters = (null).
ondemand set.
[2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo()
failed: Name or service not known
[2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to
resolve "(null)"
[2020-12-14T16:02:12.780] error: slurm_set_port: attempting to
set port without address family
[2020-12-14T16:02:12.782] error: Error creating slurm stream
socket: Address family not supported by protocol
[2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port
error Address family not supported by protocol
Strangely, the daemon works fine when it is rebooted. After
running
systemctl restart slurmctld.service
the service status is
[admin@localhost ~]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/etc/systemd/system/slurmctld.service;
enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
└─override.conf
Active: active (running) since Mon 2020-12-14 16:14:24
PST; 3s ago
Main PID: 2815 (slurmctld)
Tasks: 7
Memory: 1.9M
CPU: 15ms
CGroup: /system.slice/slurmctld.service
└─2815 /usr/local/sbin/slurmctld -D
Dec 14 16:14:24 localhost.localdomain systemd[1]: Started
Slurm controller daemon.
Could anyone point me towards how to fix this? I expect it's
just an issue with my configuration file, which I've copied
below for reference.
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=localhost
ControlMachine=localhost
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/home/slurm/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/home/slurm/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd/
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/home/slurm/spool/
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=info
SlurmctldLogFile=/home/slurm/log/slurmctld.log
#SlurmdDebug=info
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1
CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN
PartitionName=full Nodes=localhost Default=YES
MaxTime=INFINITE State=UP
Thanks!
-John