Hi Luke and Avery, Changed the After line in the slurmctld.service file to After=network.target munge.service slurmd.service
This seemed to do the trick! Best, John On Mon, Dec 14, 2020 at 6:10 PM Avery Grieve <agri...@umich.edu> wrote: > Hey Luke, I'm getting the same issues with my slurmctld daemon not > starting on boot (as well as my slurmd daemon). Both fail with the same > messages John got above (just Exit Code). > > My slurmctld service file in /etc/systemd/system/ looks like this: > > [Unit] > Description=Slurm controller daemon > After=network.target munge.service > ConditionPathExists=/etc/slurm-llnl/slurm.conf > > [Service] > Type=simple > EnvironmentFile=-/etc/default/slurmctld > ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS > ExecReload=/bin/kill -HUP $MAINPID > LimitNOFILE=65536 > > [Install] > WantedBy=multi-user.target > > Similar to John, my daemon starts if I just run the systemctl start > command following boot. > > ~Avery Grieve > They/Them/Theirs please! > University of Michigan > > > On Mon, Dec 14, 2020 at 8:06 PM Luke Yeager <lyea...@nvidia.com> wrote: > >> What does your ‘slurmctld.service’ look like? You might want to add >> something to the ‘After=’ section if your service is starting too quickly. >> >> >> >> e.g. we use ‘After=network.target munge.service’ (see here >> <https://github.com/NVIDIA/nephele-packages/blob/30bc321c311398cc7a86485bc88930e4b6790fb4/slurm/debian/PACKAGE-control.slurmctld.service#L3>). >> >> >> >> >> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >> Of *Alpha Experiment >> *Sent:* Monday, December 14, 2020 4:20 PM >> *To:* slurm-users@lists.schedmd.com >> *Subject:* [slurm-users] slurmctld daemon error >> >> >> >> *External email: Use caution opening links or attachments* >> >> >> >> Hi, >> >> >> >> I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is >> running correctly; however the slurmctld daemon always errors. >> >> [admin@localhost ~]$ systemctl status slurmd.service >> ● slurmd.service - Slurm node daemon >> Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor >> preset: disabled) >> Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago >> Main PID: 2363 (slurmd) >> Tasks: 2 >> Memory: 3.4M >> CPU: 211ms >> CGroup: /system.slice/slurmd.service >> └─2363 /usr/local/sbin/slurmd -D >> Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node >> daemon. >> >> [admin@localhost ~]$ systemctl status slurmctld.service >> ● slurmctld.service - Slurm controller daemon >> Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; >> vendor preset: disabled) >> Drop-In: /etc/systemd/system/slurmctld.service.d >> └─override.conf >> Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 >> PST; 11min ago >> Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D >> $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE) >> Main PID: 1972 (code=exited, status=1/FAILURE) >> CPU: 21ms >> Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm >> controller daemon. >> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main >> process exited, code=exited, status=1/FAILURE >> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: >> Failed with result 'exit-code'. >> >> >> >> The slurmctld log is as follows: >> >> [2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster >> cluster >> [2020-12-14T16:02:12.739] No memory enforcing mechanism configured. >> [2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: >> Name or service not known >> [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve >> "localhost" >> [2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not >> supported >> [2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost >> [2020-12-14T16:02:12.772] Recovered state of 1 nodes >> [2020-12-14T16:02:12.772] Recovered information about 0 jobs >> [2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array: >> select/cons_tres: preparing for 1 partitions >> [2020-12-14T16:02:12.779] Recovered state of 0 reservations >> [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified >> [2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure: >> select/cons_tres: reconfigure >> [2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array: >> select/cons_tres: preparing for 1 partitions >> [2020-12-14T16:02:12.779] Running as primary controller >> [2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set >> [2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set. >> [2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: >> Name or service not known >> [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve >> "(null)" >> [2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port >> without address family >> [2020-12-14T16:02:12.782] error: Error creating slurm stream socket: >> Address family not supported by protocol >> >> [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error Address >> family not supported by protocol >> >> >> >> Strangely, the daemon works fine when it is rebooted. After running >> >> systemctl restart slurmctld.service >> >> >> >> the service status is >> >> [admin@localhost ~]$ systemctl status slurmctld.service >> >> ● slurmctld.service - Slurm controller daemon >> Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; >> vendor preset: disabled) >> Drop-In: /etc/systemd/system/slurmctld.service.d >> └─override.conf >> Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago >> Main PID: 2815 (slurmctld) >> Tasks: 7 >> Memory: 1.9M >> CPU: 15ms >> CGroup: /system.slice/slurmctld.service >> └─2815 /usr/local/sbin/slurmctld -D >> Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm >> controller daemon. >> >> >> >> Could anyone point me towards how to fix this? I expect it's just an >> issue with my configuration file, which I've copied below for reference. >> >> # slurm.conf file generated by configurator easy.html. >> # Put this file on all nodes of your cluster. >> # See the slurm.conf man page for more information. >> # >> #SlurmctldHost=localhost >> ControlMachine=localhost >> # >> #MailProg=/bin/mail >> MpiDefault=none >> #MpiParams=ports=#-# >> ProctrackType=proctrack/cgroup >> ReturnToService=1 >> SlurmctldPidFile=/home/slurm/run/slurmctld.pid >> #SlurmctldPort=6817 >> SlurmdPidFile=/home/slurm/run/slurmd.pid >> #SlurmdPort=6818 >> SlurmdSpoolDir=/var/spool/slurm/slurmd/ >> SlurmUser=slurm >> #SlurmdUser=root >> StateSaveLocation=/home/slurm/spool/ >> SwitchType=switch/none >> TaskPlugin=task/affinity >> # >> # >> # TIMERS >> #KillWait=30 >> #MinJobAge=300 >> #SlurmctldTimeout=120 >> #SlurmdTimeout=300 >> # >> # >> # SCHEDULING >> SchedulerType=sched/backfill >> SelectType=select/cons_tres >> SelectTypeParameters=CR_Core >> # >> # >> # LOGGING AND ACCOUNTING >> AccountingStorageType=accounting_storage/none >> ClusterName=cluster >> #JobAcctGatherFrequency=30 >> JobAcctGatherType=jobacct_gather/none >> #SlurmctldDebug=info >> SlurmctldLogFile=/home/slurm/log/slurmctld.log >> #SlurmdDebug=info >> #SlurmdLogFile= >> # >> # >> # COMPUTE NODES >> NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1 CoresPerSocket=64 >> ThreadsPerCore=2 State=UNKNOWN >> PartitionName=full Nodes=localhost Default=YES MaxTime=INFINITE State=UP >> >> >> >> Thanks! >> >> -John >> >