You'll definitely need to get slurmd and slurmctld working before proceeding 
further. slurmctld is the Slurm controller mentioned when you do the srun.

Though there's probably some other steps you can take to make the slurmd and 
slurmctld system services available, it might be simpler to do the rpmbuild and 
rpm commands listed on https://slurm.schedmd.com/quickstart_admin.html , right 
below the instructions you were following. Those two commands will both run 
steps 3-8 of your original procedure, and will almost definitely put the 
systemd service files in the correct location.

From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Johnsy 
K. John <johnsyjo...@gmail.com>
Date: Monday, April 19, 2021 at 7:18 AM
To: Slurm User Community List <slurm-users@lists.schedmd.com>, 
fzill...@lenovo.com <fzill...@lenovo.com>, johnsy john <johnsyjo...@gmail.com>
Subject: Re: [slurm-users] [External] Slurm Configuration assistance: Unable to 
use srun after installation (slurm on fedora 33)

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

________________________________
Hi Florian,
Thanks for the valuable reply and help.

My answers to you are in green.

*  Do you have an active support contract with SchedMD? AFAIK they only offer 
paid support.
I don't have an active support contact. I just started learning slurm by 
installing it on my fedora machine. This is the first time I am installing and 
experimenting with slurm kind of software.

*  The error message is pretty straight forward, slurmctld is not running. Did 
you start it (systemctl start slurmctld)?
I did: systemctl start slurmctld and got this message: Failed to start 
slurmctld.service: Unit slurmctld.service not found.

*  slurmd needs to run on the node(s) you want to run on as well, and as I'm 
guessing you are using localhost for the controller and want to run jobs on 
localhost, so slurmctld and slurmd need to be running on localhost.
systemctl start slurmd
Failed to start slurmd.service: Unit slurmd.service not found.
Similar to slurmctrld

*  Is munge running?
Yes. Here is the status:
[johnsy@homepc ~]$ systemctl status munge
munge.service - MUNGE authentication service
     Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor 
preset: disabled)
     Active: active (running) since Mon 2021-04-19 07:49:13 EDT; 13min ago <-- 
it is always enabled after restart. This log is just after a restart.
       Docs: man:munged(8)
    Process: 1070 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
   Main PID: 1072 (munged)
      Tasks: 4 (limit: 76969)
     Memory: 1.4M
        CPU: 8ms
     CGroup: /system.slice/munge.service
             └─1072 /usr/sbin/munged

*  May I ask why you're chown-ing pid and logfiles? The slurm user (typically 
"slurm") needs to have access to those files. Munge for instance checks for 
ownership and complains if something is not correct.
I tried to follow some instructions mentioned in: 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#copy-slurm-conf-to-all-nodes<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.fysik.dtu.dk%2Fniflheim%2FSlurm_configuration%23copy-slurm-conf-to-all-nodes&data=04%7C01%7Crenfro%40tntech.edu%7C9d875b30a7044047988908d9032d445f%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637544315232638295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OJtvIS06Pv9e4Tm82hEmXsCUuU1AZST514BzVC0v4Ow%3D&reserved=0>
I thought, as I am installing the slurm as root, the user "johnsy" has to have 
ownership permissions.

*  "srun /proc/cpuinfo" will fail, even if slurmctld and slurmd are running, 
because /proc/cpuinfo is not an executable file. You may want to insert "cat" 
after srun. Another simple test would be "srun hostname"
I tried : srun hostname and got the following error message:
srun: error: Unable to allocate resources: Unable to contact slurm controller 
(connect failure)

Also tried:
systemctl status slurmctld
Unit slurmctld.service could not be found.

Also I tried installing the packaged version: 
https://src.fedoraproject.org/rpms/slurm<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsrc.fedoraproject.org%2Frpms%2Fslurm&data=04%7C01%7Crenfro%40tntech.edu%7C9d875b30a7044047988908d9032d445f%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637544315232638295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=auUUKxDzWs0jRBzxT6ZF7HqYfXqsPJPejqxp0o95Zjo%3D&reserved=0>
 using dnf.
The same problem exists.

Any help in this regard will be appreciated.

Thanks a lot.
Johnsy


On Mon, Apr 19, 2021 at 5:04 AM Florian Zillner 
<fzill...@lenovo.com<mailto:fzill...@lenovo.com>> wrote:
Hi Johnsy,

  1.  Do you have an active support contract with SchedMD? AFAIK they only 
offer paid support.
  2.  The error message is pretty straight forward, slurmctld is not running. 
Did you start it (systemctl start slurmctld)?
  3.  slurmd needs to run on the node(s) you want to run on as well, and as I'm 
guessing you are using localhost for the controller and want to run jobs on 
localhost, so slurmctld and slurmd need to be running on localhost.
  4.  Is munge running?
  5.  May I ask why you're chown-ing pid and logfiles? The slurm user 
(typically "slurm") needs to have access to those files. Munge for instance 
checks for ownership and complains if something is not correct.
  6.  "srun /proc/cpuinfo" will fail, even if slurmctld and slurmd are running, 
because /proc/cpuinfo is not an executable file. You may want to insert "cat" 
after srun. Another simple test would be "srun hostname"
And, just my personal opinion, if this is your first experiment with Slurm, I 
wouldn't change too much right from the beginning but instead get it working 
first and then change things to your needs. Slurm is also available in the EPEL 
repos, so you could install it using dnf and experiment with the packaged 
version.

Hope this helps,
Florian


________________________________
From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Johnsy K. John 
<johnsyjo...@gmail.com<mailto:johnsyjo...@gmail.com>>
Sent: Monday, 19 April 2021 01:43
To: sa...@schedmd.com<mailto:sa...@schedmd.com> 
<sa...@schedmd.com<mailto:sa...@schedmd.com>>; johnsy john 
<johnsyjo...@gmail.com<mailto:johnsyjo...@gmail.com>>; 
slurm-us...@schedmd.com<mailto:slurm-us...@schedmd.com> 
<slurm-us...@schedmd.com<mailto:slurm-us...@schedmd.com>>
Subject: [External] [slurm-users] Slurm Configuration assistance: Unable to use 
srun after installation (slurm on fedora 33)

Hello SchedMD team,

I would like to use your slurm workload manager for learning purposes.
And I tried installing the the software (downloaded from: 
https://www.schedmd.com/downloads.php<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.schedmd.com%2Fdownloads.php&data=04%7C01%7Crenfro%40tntech.edu%7C9d875b30a7044047988908d9032d445f%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637544315232648285%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=a0ee7s8xGUe5gt%2FbCLCHmX5iNxaaVnWFLYTB08pPXe0%3D&reserved=0>
 ) and followed the steps as mentioned in:

https://slurm.schedmd.com/download.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fdownload.html&data=04%7C01%7Crenfro%40tntech.edu%7C9d875b30a7044047988908d9032d445f%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637544315232648285%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=r10%2FeoMm6LMp86E9suKr3CoabGhWK1IKgbhnqBtkkhg%3D&reserved=0>
https://slurm.schedmd.com/quickstart_admin.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fquickstart_admin.html&data=04%7C01%7Crenfro%40tntech.edu%7C9d875b30a7044047988908d9032d445f%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637544315232658280%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AjAKtmAgQekrMLhiQMu2hV9mR9n19ZDUWJhwcb%2F9Hnw%3D&reserved=0>

My Linux OS is fedora 33, and i tried installing it as root login.
After installation and configuration as mentioned in page: 
https://slurm.schedmd.com/quickstart_admin.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fquickstart_admin.html&data=04%7C01%7Crenfro%40tntech.edu%7C9d875b30a7044047988908d9032d445f%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637544315232658280%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AjAKtmAgQekrMLhiQMu2hV9mR9n19ZDUWJhwcb%2F9Hnw%3D&reserved=0>
I got some errors when I tried to do srun.
Details about the installation and use are as follows:

Using root permissions, copied to: /root/installations/

cd /root/installations/

tar --bzip -x -f slurm-20.11.5.tar.bz2

cd slurm-20.11.5/

./configure --enable-debug --prefix=/usr/local --sysconfdir=/usr/local/etc

make
make install

Following steps are based on 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.fysik.dtu.dk%2Fniflheim%2FSlurm_configuration&data=04%7C01%7Crenfro%40tntech.edu%7C9d875b30a7044047988908d9032d445f%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637544315232668271%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=L6FrrioLgjPEB66IKhBRK00zLFJ9cT3YP5h1yli997U%3D&reserved=0>
mkdir /var/spool/slurmctld /var/log/slurm
chown johnsy /var/spool/slurmctld
chown johnsy /var/log/slurm
chmod 755 /var/spool/slurmctld /var/log/slurm

 cp /var/run/slurmctld.pid /var/run/slurmd.pid

touch /var/log/slurm/slurmctld.log
chown johnsy /var/log/slurm/slurmctld.log

touch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
chown johnsy /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log

ldconfig -n /usr/lib64

Now when I tried an example command for trial:

srun /proc/cpuinfo

I get the following error:

srun: error: Unable to allocate resources: Unable to contact slurm controller 
(connect failure)


My configuration file slurm.conf f that i created is:
######################################################################################################
######################################################################################################
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=homepc
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=johnsy
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
#SlurmctldLogFile=
SlurmdDebug=info
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=localhost CPUs=12 Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 
State=UNKNOWN
PartitionName=debug Nodes=localhost Default=YES MaxTime=INFINITE State=UP
######################################################################################################
######################################################################################################
######################################################################################################

Reply via email to