Unfortunately that didn't work,

However i modified my slurm.conf to lie and say i had 16 cpu on 1 thread and now everything is working fine.

One issue with CLOUD state machines is that is when i run scontrol show nodes they don't show up, is there a way i can get their info when they are not 'running'

Thanks
Nathan

On 20/5/19 12:04 am, Riebs, Andy wrote:

Just looking at this quickly, have you tried specifying “hint=multithread” as an sbatch parameter?

*From:*slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On Behalf Of *nathan norton
*Sent:* Saturday, May 18, 2019 6:03 PM
*To:* slurm-users@lists.schedmd.com
*Subject:* [slurm-users] final stages of cloud infrastructure set up

Hi,

I am in the process of setting up Slurm using Amazon cloud infrastructure. All is going well, I can elastically start and stop nodes when jobs run.  I am running into a few small teething issues, that are probably due to me not understanding some of the terminology here. At a high level all the nodes given to end users in the cloud are hyper threaded, so I want to use my nodes as hyper threaded nodes.  All nodes are running centos7 latest. I would also like the jobs to be run in a cgroup and not migrate around after it starts. As I said before I think most of it is working except for the few issues below here.

My use case is, I have an in house built binary application that is single threaded and does no  message passing or anything like that. The application is not memory bound it is only compute bound.

So on a node I would like to be able to run 16 instances in parallel. As can be seen below if I launch the single app via srun it runs on each thread on a CPU.  However if I run the via sbatch command as can be seen it only runs on CPU 0-7 instead of CPU 0-15.

Another question would be how would be the best way to retry failed jobs, I can rerun the batch again, but I only want to rerun a single step in the batch?

Please see below for the output of various commands as well as my slurm.conf file as well.

Many thanks

Nathan.

______________________________________________________________________

btuser@bt_slurm_login001[domain ]% slurmd -V

slurm 18.08.6-2

______________________________________________________________________

btuser@bt_slurm_login001[domain ]% cat nathan.batch.sh

#!/bin/bash

#SBATCH --job-name=nathan_test

#SBATCH --ntasks=1

#SBATCH --array=1-32

#SBATCH --ntasks-per-core=2

hostname

srun --hint=multithread -n1  --exclusive --cpu_bind=threads cat /proc/self/status | grep -i cpus_allowed_list

btuser@bt_slurm_login001[domain ]%

btuser@bt_slurm_login001[domain ]% sbatch Nathan.batch.sh

Submitted batch job 106491

btuser@bt_slurm_login001[domain ]% cat slurm-106491_*

btuser@bt_slurm_login001[domain ]% cat slurm-106491_*

ip-10-0-8-89.ec2.internal

Cpus_allowed_list:      1

ip-10-0-8-89.ec2.internal

Cpus_allowed_list:      2

ip-10-0-8-89.ec2.internal

Cpus_allowed_list:      3

ip-10-0-8-89.ec2.internal

Cpus_allowed_list:      4

ip-10-0-8-89.ec2.internal

Cpus_allowed_list:      5

ip-10-0-8-89.ec2.internal

Cpus_allowed_list:      6

ip-10-0-8-89.ec2.internal

Cpus_allowed_list:      7

ip-10-0-8-90.ec2.internal

Cpus_allowed_list:      0

ip-10-0-8-90.ec2.internal

Cpus_allowed_list:      1

ip-10-0-8-90.ec2.internal

Cpus_allowed_list:      2

ip-10-0-8-88.ec2.internal

Cpus_allowed_list:      0

ip-10-0-8-90.ec2.internal

Cpus_allowed_list:      3

ip-10-0-8-90.ec2.internal

Cpus_allowed_list:      4

ip-10-0-8-90.ec2.internal

Cpus_allowed_list:      5

ip-10-0-8-90.ec2.internal

Cpus_allowed_list:      6

ip-10-0-8-90.ec2.internal

Cpus_allowed_list:      7

ip-10-0-8-91.ec2.internal

Cpus_allowed_list:      0

ip-10-0-8-91.ec2.internal

Cpus_allowed_list:      1

ip-10-0-8-91.ec2.internal

Cpus_allowed_list:      2

ip-10-0-8-91.ec2.internal

Cpus_allowed_list:      3

ip-10-0-8-91.ec2.internal

Cpus_allowed_list:      4

ip-10-0-8-88.ec2.internal

Cpus_allowed_list:      1

ip-10-0-8-91.ec2.internal

Cpus_allowed_list:      5

ip-10-0-8-91.ec2.internal

Cpus_allowed_list:      6

ip-10-0-8-91.ec2.internal

Cpus_allowed_list:      7

ip-10-0-8-88.ec2.internal

Cpus_allowed_list:      2

ip-10-0-8-88.ec2.internal

Cpus_allowed_list:      3

ip-10-0-8-88.ec2.internal

Cpus_allowed_list:      4

ip-10-0-8-88.ec2.internal

Cpus_allowed_list:      5

ip-10-0-8-88.ec2.internal

Cpus_allowed_list:      6

ip-10-0-8-88.ec2.internal

Cpus_allowed_list:      7

ip-10-0-8-89.ec2.internal

Cpus_allowed_list:      0

______________________________________________________________________

btuser@bt_slurm_login001[domain ]% srun -n32 --exclusive   --cpu_bind=threads cat /proc/self/status | grep -i cpus_allowed_list

Cpus_allowed_list:      12

Cpus_allowed_list:      13

Cpus_allowed_list:      15

Cpus_allowed_list:      0

Cpus_allowed_list:      8

Cpus_allowed_list:      1

Cpus_allowed_list:      9

Cpus_allowed_list:      2

Cpus_allowed_list:      10

Cpus_allowed_list:      11

Cpus_allowed_list:      4

Cpus_allowed_list:      5

Cpus_allowed_list:      6

Cpus_allowed_list:      14

Cpus_allowed_list:      7

Cpus_allowed_list:      3

Cpus_allowed_list:      0

Cpus_allowed_list:      8

Cpus_allowed_list:      1

Cpus_allowed_list:      9

Cpus_allowed_list:      2

Cpus_allowed_list:      10

Cpus_allowed_list:      3

Cpus_allowed_list:      11

Cpus_allowed_list:      4

Cpus_allowed_list:      12

Cpus_allowed_list:      5

Cpus_allowed_list:      6

Cpus_allowed_list:      14

Cpus_allowed_list:      7

Cpus_allowed_list:      15

Cpus_allowed_list:      13

______________________________________________________________________

[sysadmin@ip-10-0-8-88 ~]$ slurmd -C

NodeName=ip-10-0-8-88 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=30986

UpTime=0-00:06:10

______________________________________________________________________

Cloud server stats:

[sysadmin@ip-10-0-8-88 ~]$ lscpu  -e

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE

0   0    0      0    0:0:0:0       yes

1   0    0      1    1:1:1:0       yes

2   0    0      2    2:2:2:0       yes

3   0    0      3    3:3:3:0       yes

4   0    0      4    4:4:4:0       yes

5   0    0      5    5:5:5:0       yes

6   0    0      6    6:6:6:0       yes

7   0    0      7    7:7:7:0       yes

8   0    0      0    0:0:0:0       yes

9   0    0      1    1:1:1:0       yes

10  0    0      2    2:2:2:0       yes

11  0    0      3    3:3:3:0       yes

12  0    0      4    4:4:4:0       yes

13  0    0      5    5:5:5:0       yes

14  0    0      6    6:6:6:0       yes

15  0    0      7    7:7:7:0       yes

______________________________________________________________________

# slurm.conf file generated by configurator easy.html.

SlurmctldHost=bt_slurm_master

MailProg=/dev/null

MpiDefault=none

ProctrackType=proctrack/cgroup

ReturnToService=1

SlurmctldPidFile=/var/run/slurmd/slurmctld.pid

SlurmdPidFile=/var/run/slurmd/slurmd.pid

SlurmdSpoolDir=/var/spool/slurmctld/slurmd

SlurmUser=slurm

StateSaveLocation=/var/spool/slurmctld

SwitchType=switch/none

TaskPlugin=task/cgroup

TaskPluginParam=Threads

SlurmdTimeout=500

FastSchedule=1

SchedulerType=sched/backfill

SelectType=select/cons_res

SelectTypeParameters=CR_CPU

AccountingStorageType=accounting_storage/none

ClusterName=simplecluster

JobAcctGatherType=jobacct_gather/none

PropagatePrioProcess=2

MaxTasksPerNode=16

ResumeProgram=/bt/admin/slurm/etc/slurm_ec2_startup.sh

ResumeTimeout=900

ResumeRate=0

SuspendProgram=/bt/admin/slurm/etc/slurm_ec2_shutdown.sh

SuspendTime=600

SuspendTimeout=120

TreeWidth=1024

SuspendRate=0

NodeName=ip-10-0-8-[2-100] CPUs=16 RealMemory=27648 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2  State=CLOUD

NodeName=bt_slurm_login00[1-10] RealMemory=512 State=DOWN

PartitionName=backtest Nodes=ip-10-0-8-[2-100] Default=YES MaxTime=300 Oversubscribe=NO State=UP

______________________________________________________________________

[sysadmin@bt_slurm_master ~]$ cat /etc/slurm/cgroup.conf

###

#

# Slurm cgroup support configuration file

#

# See man slurm.conf and man cgroup.conf for further

# information on cgroup configuration parameters

#--

CgroupAutomount=yes

CgroupMountpoint="/sys/fs/cgroup"

TaskAffinity=yes

ConstrainCores=yes

ConstrainRAMSpace=no

______________________________________________________________________

JobId=107424 ArrayJobId=107423 ArrayTaskId=1 JobName=aijing_test

   UserId=btuser(1001) GroupId=users(100) MCS_label=N/A

   Priority=4294901612 Nice=0 Account=(null) QOS=(null)

   JobState=COMPLETED Reason=None Dependency=(null)

   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   DerivedExitCode=0:0

   RunTime=00:01:06 TimeLimit=05:00:00 TimeMin=N/A

   SubmitTime=2019-05-17T01:55:56 EligibleTime=2019-05-17T01:55:56

   AccrueTime=Unknown

   StartTime=2019-05-17T01:55:56 EndTime=2019-05-17T01:57:02 Deadline=N/A

   PreemptTime=None SuspendTime=None SecsPreSuspend=0

   LastSchedEval=2019-05-17T01:55:56

   Partition=backtest AllocNode:Sid=bt_slurm_login001:14002

  ReqNodeList=(null) ExcNodeList=(null)

   NodeList=ip-10-0-8-88

   BatchHost=ip-10-0-8-88

   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

   TRES=cpu=1,node=1,billing=1

   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

     Nodes=ip-10-0-8-88 CPU_IDs=0-1 Mem=0 GRES_IDX=

   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

   Features=(null) DelayBoot=00:00:00

   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

Command=/bt/data/backtester/destination_tables_94633/batch_run.sh

WorkDir=/bt/data/backtester/destination_tables_94633

StdErr=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out

   StdIn=/dev/null

StdOut=/bt/data/backtester/destination_tables_94633/slurm-107423_1.out

   Power=


Reply via email to