Just looking at this quickly, have you tried specifying “hint=multithread” as 
an sbatch parameter?

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
nathan norton
Sent: Saturday, May 18, 2019 6:03 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] final stages of cloud infrastructure set up

Hi,
I am in the process of setting up Slurm using Amazon cloud infrastructure. All 
is going well, I can elastically start and stop nodes when jobs run.  I am 
running into a few small teething issues, that are probably due to me not 
understanding some of the terminology here. At a high level all the nodes given 
to end users in the cloud are hyper threaded, so I want to use my nodes as 
hyper threaded nodes.  All nodes are running centos7 latest. I would also like 
the jobs to be run in a cgroup and not migrate around after it starts. As I 
said before I think most of it is working except for the few issues below here.

My use case is, I have an in house built binary application that is single 
threaded and does no  message passing or anything like that. The application is 
not memory bound it is only compute bound.

So on a node I would like to be able to run 16 instances in parallel. As can be 
seen below if I launch the single app via srun it runs on each thread on a CPU. 
 However if I run the via sbatch command as can be seen it only runs on CPU 0-7 
instead of CPU 0-15.

Another question would be how would be the best way to retry failed jobs, I can 
rerun the batch again, but I only want to rerun a single step in the batch?

Please see below for the output of various commands as well as my slurm.conf 
file as well.

Many thanks
Nathan.

______________________________________________________________________
btuser@bt_slurm_login001[domain ]% slurmd -V
slurm 18.08.6-2
______________________________________________________________________
btuser@bt_slurm_login001[domain ]% cat nathan.batch.sh
#!/bin/bash
#SBATCH --job-name=nathan_test
#SBATCH --ntasks=1
#SBATCH --array=1-32
#SBATCH --ntasks-per-core=2
hostname
srun --hint=multithread -n1  --exclusive   --cpu_bind=threads cat 
/proc/self/status | grep -i cpus_allowed_list


btuser@bt_slurm_login001[domain ]%
btuser@bt_slurm_login001[domain ]% sbatch Nathan.batch.sh
Submitted batch job 106491
btuser@bt_slurm_login001[domain ]% cat slurm-106491_*
btuser@bt_slurm_login001[domain ]% cat slurm-106491_*
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:      1
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:      2
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:      3
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:      4
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:      5
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:      6
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:      7
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:      0
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:      1
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:      2
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:      0
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:      3
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:      4
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:      5
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:      6
ip-10-0-8-90.ec2.internal
Cpus_allowed_list:      7
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:      0
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:      1
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:      2
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:      3
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:      4
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:      1
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:      5
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:      6
ip-10-0-8-91.ec2.internal
Cpus_allowed_list:      7
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:      2
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:      3
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:      4
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:      5
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:      6
ip-10-0-8-88.ec2.internal
Cpus_allowed_list:      7
ip-10-0-8-89.ec2.internal
Cpus_allowed_list:      0
______________________________________________________________________
btuser@bt_slurm_login001[domain ]% srun -n32  --exclusive   --cpu_bind=threads 
cat /proc/self/status | grep -i cpus_allowed_list
Cpus_allowed_list:      12
Cpus_allowed_list:      13
Cpus_allowed_list:      15
Cpus_allowed_list:      0
Cpus_allowed_list:      8
Cpus_allowed_list:      1
Cpus_allowed_list:      9
Cpus_allowed_list:      2
Cpus_allowed_list:      10
Cpus_allowed_list:      11
Cpus_allowed_list:      4
Cpus_allowed_list:      5
Cpus_allowed_list:      6
Cpus_allowed_list:      14
Cpus_allowed_list:      7
Cpus_allowed_list:      3
Cpus_allowed_list:      0
Cpus_allowed_list:      8
Cpus_allowed_list:      1
Cpus_allowed_list:      9
Cpus_allowed_list:      2
Cpus_allowed_list:      10
Cpus_allowed_list:      3
Cpus_allowed_list:      11
Cpus_allowed_list:      4
Cpus_allowed_list:      12
Cpus_allowed_list:      5
Cpus_allowed_list:      6
Cpus_allowed_list:      14
Cpus_allowed_list:      7
Cpus_allowed_list:      15
Cpus_allowed_list:      13
______________________________________________________________________
[sysadmin@ip-10-0-8-88 ~]$ slurmd -C
NodeName=ip-10-0-8-88 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 
ThreadsPerCore=2 RealMemory=30986
UpTime=0-00:06:10
______________________________________________________________________
Cloud server stats:
[sysadmin@ip-10-0-8-88 ~]$ lscpu  -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0   0    0      0    0:0:0:0       yes
1   0    0      1    1:1:1:0       yes
2   0    0      2    2:2:2:0       yes
3   0    0      3    3:3:3:0       yes
4   0    0      4    4:4:4:0       yes
5   0    0      5    5:5:5:0       yes
6   0    0      6    6:6:6:0       yes
7   0    0      7    7:7:7:0       yes
8   0    0      0    0:0:0:0       yes
9   0    0      1    1:1:1:0       yes
10  0    0      2    2:2:2:0       yes
11  0    0      3    3:3:3:0       yes
12  0    0      4    4:4:4:0       yes
13  0    0      5    5:5:5:0       yes
14  0    0      6    6:6:6:0       yes
15  0    0      7    7:7:7:0       yes
______________________________________________________________________
# slurm.conf file generated by configurator easy.html.
SlurmctldHost=bt_slurm_master
MailProg=/dev/null
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmd/slurmctld.pid
SlurmdPidFile=/var/run/slurmd/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmctld/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup
TaskPluginParam=Threads
SlurmdTimeout=500
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
AccountingStorageType=accounting_storage/none
ClusterName=simplecluster
JobAcctGatherType=jobacct_gather/none
PropagatePrioProcess=2
MaxTasksPerNode=16
ResumeProgram=/bt/admin/slurm/etc/slurm_ec2_startup.sh
ResumeTimeout=900
ResumeRate=0
SuspendProgram=/bt/admin/slurm/etc/slurm_ec2_shutdown.sh
SuspendTime=600
SuspendTimeout=120
TreeWidth=1024
SuspendRate=0
NodeName=ip-10-0-8-[2-100] CPUs=16 RealMemory=27648 Sockets=1 CoresPerSocket=8 
ThreadsPerCore=2  State=CLOUD
NodeName=bt_slurm_login00[1-10] RealMemory=512 State=DOWN
PartitionName=backtest Nodes=ip-10-0-8-[2-100] Default=YES MaxTime=300 
Oversubscribe=NO State=UP

______________________________________________________________________

[sysadmin@bt_slurm_master ~]$ cat /etc/slurm/cgroup.conf
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes
CgroupMountpoint="/sys/fs/cgroup"
TaskAffinity=yes
ConstrainCores=yes
ConstrainRAMSpace=no

______________________________________________________________________


JobId=107424 ArrayJobId=107423 ArrayTaskId=1 JobName=aijing_test
   UserId=btuser(1001) GroupId=users(100) MCS_label=N/A
   Priority=4294901612 Nice=0 Account=(null) QOS=(null)
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:01:06 TimeLimit=05:00:00 TimeMin=N/A
   SubmitTime=2019-05-17T01:55:56 EligibleTime=2019-05-17T01:55:56
   AccrueTime=Unknown
   StartTime=2019-05-17T01:55:56 EndTime=2019-05-17T01:57:02 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-05-17T01:55:56
   Partition=backtest AllocNode:Sid=bt_slurm_login001:14002
  ReqNodeList=(null) ExcNodeList=(null)
   NodeList=ip-10-0-8-88
   BatchHost=ip-10-0-8-88
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=ip-10-0-8-88 CPU_IDs=0-1 Mem=0 GRES_IDX=
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bt/data/backtester/destination_tables_94633/batch_run.sh
   WorkDir=/bt/data/backtester/destination_tables_94633
   StdErr=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out
   StdIn=/dev/null
   StdOut=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out
   Power=

Reply via email to