Unfortunately that didn't work,
However i modified my slurm.conf to lie and say i had 16 cpu on 1 thread
and now everything is working fine.
One issue with CLOUD state machines is that is when i run scontrol show
nodes they don't show up, is there a way i can get their info when they
are not 'running'
Thanks
Nathan
On 20/5/19 12:04 am, Riebs, Andy wrote:
Just looking at this quickly, have you tried specifying
“hint=multithread” as an sbatch parameter?
*From:*slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
Behalf Of *nathan norton
*Sent:* Saturday, May 18, 2019 6:03 PM
*To:* slurm-users@lists.schedmd.com
*Subject:* [slurm-users] final stages of cloud infrastructure set up
Hi,
I am in the process of setting up Slurm using Amazon cloud
infrastructure. All is going well, I can elastically start and stop
nodes when jobs run. I am running into a few small teething issues,
that are probably due to me not understanding some of the terminology
here. At a high level all the nodes given to end users in the cloud
are hyper threaded, so I want to use my nodes as hyper threaded nodes.
All nodes are running centos7 latest. I would also like the jobs to
be run in a cgroup and not migrate around after it starts. As I said
before I think most of it is working except for the few issues below
here.
My use case is, I have an in house built binary application that is
single threaded and does no message passing or anything like that.
The application is not memory bound it is only compute bound.
So on a node I would like to be able to run 16 instances in parallel.
As can be seen below if I launch the single app via srun it runs on
each thread on a CPU. However if I run the via sbatch command as can
be seen it only runs on CPU 0-7 instead of CPU 0-15.
Another question would be how would be the best way to retry failed
jobs, I can rerun the batch again, but I only want to rerun a single
step in the batch?
Please see below for the output of various commands as well as my
slurm.conf file as well.
Many thanks
Nathan.
______________________________________________________________________
btuser@bt_slurm_login001[domain ]% slurmd -V
slurm 18.08.6-2
______________________________________________________________________
btuser@bt_slurm_login001[domain ]% cat nathan.batch.sh
#!/bin/bash
#SBATCH --job-name=nathan_test
#SBATCH --ntasks=1
#SBATCH --array=1-32
#SBATCH --ntasks-per-core=2
hostname
srun --hint=multithread -n1 --exclusive --cpu_bind=threads cat
/proc/self/status | grep -i cpus_allowed_list
btuser@bt_slurm_login001[domain ]%
btuser@bt_slurm_login001[domain ]% sbatch Nathan.batch.sh
Submitted batch job 106491
btuser@bt_slurm_login001[domain ]% cat slurm-106491_*
btuser@bt_slurm_login001[domain ]% cat slurm-106491_*
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 1
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 2
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 3
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 4
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 5
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 6
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 7
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 0
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 1
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 2
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 0
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 3
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 4
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 5
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 6
ip-10-0-8-90.ec2.internal
Cpus_allowed_list: 7
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 0
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 1
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 2
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 3
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 4
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 1
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 5
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 6
ip-10-0-8-91.ec2.internal
Cpus_allowed_list: 7
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 2
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 3
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 4
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 5
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 6
ip-10-0-8-88.ec2.internal
Cpus_allowed_list: 7
ip-10-0-8-89.ec2.internal
Cpus_allowed_list: 0
______________________________________________________________________
btuser@bt_slurm_login001[domain ]% srun -n32 --exclusive
--cpu_bind=threads cat /proc/self/status | grep -i cpus_allowed_list
Cpus_allowed_list: 12
Cpus_allowed_list: 13
Cpus_allowed_list: 15
Cpus_allowed_list: 0
Cpus_allowed_list: 8
Cpus_allowed_list: 1
Cpus_allowed_list: 9
Cpus_allowed_list: 2
Cpus_allowed_list: 10
Cpus_allowed_list: 11
Cpus_allowed_list: 4
Cpus_allowed_list: 5
Cpus_allowed_list: 6
Cpus_allowed_list: 14
Cpus_allowed_list: 7
Cpus_allowed_list: 3
Cpus_allowed_list: 0
Cpus_allowed_list: 8
Cpus_allowed_list: 1
Cpus_allowed_list: 9
Cpus_allowed_list: 2
Cpus_allowed_list: 10
Cpus_allowed_list: 3
Cpus_allowed_list: 11
Cpus_allowed_list: 4
Cpus_allowed_list: 12
Cpus_allowed_list: 5
Cpus_allowed_list: 6
Cpus_allowed_list: 14
Cpus_allowed_list: 7
Cpus_allowed_list: 15
Cpus_allowed_list: 13
______________________________________________________________________
[sysadmin@ip-10-0-8-88 ~]$ slurmd -C
NodeName=ip-10-0-8-88 CPUs=16 Boards=1 SocketsPerBoard=1
CoresPerSocket=8 ThreadsPerCore=2 RealMemory=30986
UpTime=0-00:06:10
______________________________________________________________________
Cloud server stats:
[sysadmin@ip-10-0-8-88 ~]$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0 0 0 0 0:0:0:0 yes
1 0 0 1 1:1:1:0 yes
2 0 0 2 2:2:2:0 yes
3 0 0 3 3:3:3:0 yes
4 0 0 4 4:4:4:0 yes
5 0 0 5 5:5:5:0 yes
6 0 0 6 6:6:6:0 yes
7 0 0 7 7:7:7:0 yes
8 0 0 0 0:0:0:0 yes
9 0 0 1 1:1:1:0 yes
10 0 0 2 2:2:2:0 yes
11 0 0 3 3:3:3:0 yes
12 0 0 4 4:4:4:0 yes
13 0 0 5 5:5:5:0 yes
14 0 0 6 6:6:6:0 yes
15 0 0 7 7:7:7:0 yes
______________________________________________________________________
# slurm.conf file generated by configurator easy.html.
SlurmctldHost=bt_slurm_master
MailProg=/dev/null
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmd/slurmctld.pid
SlurmdPidFile=/var/run/slurmd/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmctld/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup
TaskPluginParam=Threads
SlurmdTimeout=500
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
AccountingStorageType=accounting_storage/none
ClusterName=simplecluster
JobAcctGatherType=jobacct_gather/none
PropagatePrioProcess=2
MaxTasksPerNode=16
ResumeProgram=/bt/admin/slurm/etc/slurm_ec2_startup.sh
ResumeTimeout=900
ResumeRate=0
SuspendProgram=/bt/admin/slurm/etc/slurm_ec2_shutdown.sh
SuspendTime=600
SuspendTimeout=120
TreeWidth=1024
SuspendRate=0
NodeName=ip-10-0-8-[2-100] CPUs=16 RealMemory=27648 Sockets=1
CoresPerSocket=8 ThreadsPerCore=2 State=CLOUD
NodeName=bt_slurm_login00[1-10] RealMemory=512 State=DOWN
PartitionName=backtest Nodes=ip-10-0-8-[2-100] Default=YES MaxTime=300
Oversubscribe=NO State=UP
______________________________________________________________________
[sysadmin@bt_slurm_master ~]$ cat /etc/slurm/cgroup.conf
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes
CgroupMountpoint="/sys/fs/cgroup"
TaskAffinity=yes
ConstrainCores=yes
ConstrainRAMSpace=no
______________________________________________________________________
JobId=107424 ArrayJobId=107423 ArrayTaskId=1 JobName=aijing_test
UserId=btuser(1001) GroupId=users(100) MCS_label=N/A
Priority=4294901612 Nice=0 Account=(null) QOS=(null)
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:01:06 TimeLimit=05:00:00 TimeMin=N/A
SubmitTime=2019-05-17T01:55:56 EligibleTime=2019-05-17T01:55:56
AccrueTime=Unknown
StartTime=2019-05-17T01:55:56 EndTime=2019-05-17T01:57:02 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-05-17T01:55:56
Partition=backtest AllocNode:Sid=bt_slurm_login001:14002
ReqNodeList=(null) ExcNodeList=(null)
NodeList=ip-10-0-8-88
BatchHost=ip-10-0-8-88
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
Nodes=ip-10-0-8-88 CPU_IDs=0-1 Mem=0 GRES_IDX=
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/bt/data/backtester/destination_tables_94633/batch_run.sh
WorkDir=/bt/data/backtester/destination_tables_94633
StdErr=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out
StdIn=/dev/null
StdOut=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out
Power=