Re: [slurm-users] I can't seem to use all the CPUs in my Cluster?

Brian Andrus Tue, 13 Dec 2022 09:26:43 -0800

You can use slurm with hyperthreaded cores. It takes awareness whenconfiguring and requesting the resources.

The can of worms you are opening is the stance (in HPC) thathyperthreading is detrimental. If you are using HPC as intended, Icompletely agree with this stance. The objective is to be as efficientas possible with the resources. If you have 4 cores running at 100%, youwill lose efficiency by turning it into 8 hyperthreads and doing thesame work. So, rather than increase core count, strive for 100% utilization.

That being said, if you are running interactive jobs, those may wellbenefit having hyperthreaded cores. I have users that insist on usingan HPC node to run a Linux desktop (not what HPC is meant for, to besure). They definitely come out ahead with hyperthreading enabled.

So, it depends on a number of variables, many of which there isdisagreement on whether they should even be in the equation.

To really get a better understanding, I would steer you away fromCyclecloud and encourage you to do your own install so you can learn theknobs and gauges that are hidden from view by middleware. This list canbe a great source of help as well as the many articles, wikis and videosout there.

TLDR; If you are going to be running efficient HPC jobs, you are indeedbetter off with HT turned off.


Brian Andrus

On 12/13/2022 8:03 AM, Gary Mansell wrote:

Hi, thanks for getting back to me.

I have been doing some more experimenting, and I think that the issueis because the Azure VMs for my nodes are HyperThreaded.

Slurm sees the cluser as 5 nodes with 1 CPU and seems to ignore theHyperThreading - so hence Slurm sees the cluster as a 5 CPU cluster(and not 10 as I thought) - so it is correct that it can't run a 10cpu job.

Speaking with my CFD types - they say our code should not be run on HTnodes, so I have switched to a different Azure VM sku for the nodeswithout HT, and the CPU count in Slurm matches the count of those inthe VMs.


So - does Slurm actually ignore HT cores, as I am supposing?

Regards
Gary


On Tue, 13 Dec 2022 at 15:52, Brian Andrus <toomuc...@gmail.com> wrote:

    Gary,

    Well your first issue is using Cyclecloud, but that is mostly
    opinion :)

    Your error states there aren't enough CPUs in the partition, which
    means we should take a look at the partition settings.

    Take a look at 'scontrol show partition hpc' and see how many
    nodes are assigned to it. Also check the state of the nodes with
    'sinfo'

    It would also be good to ensure the node settings are right. Run
    'slurmd -C' on a node and see if the output matches what is in the
    config.

    Brian Andrus

    On 12/13/2022 1:38 AM, Gary Mansell wrote:


    Dear Slurm Users, perhaps you can help me with a problem that I
    am having using the Scheduler (I am new to this, so please
    forgive me for any stupid mistakes/misunderstandings).


    I am not able to submit a Multi-Threaded MPI job on a small demo
    cluster that I have setup using Azure CycleCloud that uses all
    the 10x CPUs on my cluster, and I don’t understand why – perhaps
    you can explain why and how I can fix this to use all available CPUs?

    The hpc partition that I have setup consists of 5 nodes (Azure VM
    type = Standard_F2s_v2), each with 2 cpu’s (I presume that these
    are Hyperthreaded cores, rather than 2 cpus – but I am not
    certain of this)?

    [azccadmin@ricslurm-hpc-pg0-1 ~]$ cat /proc/cpuinfo

    processor       : 0

    vendor_id       : GenuineIntel

    cpu family      : 6

    model           : 106

    model name      : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz

    stepping        : 6

    microcode       : 0xffffffff

    cpu MHz         : 2793.436

    cache size      : 49152 KB

    physical id     : 0

    siblings        : 2

    core id         : 0

    cpu cores       : 1

    apicid          : 0

    initial apicid  : 0

    fpu             : yes

    fpu_exception   : yes

    cpuid level     : 21

    wp              : yes

    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
    mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht
    syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology
    eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2
    movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
    3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase
    bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed
    adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec
    md_clear

    bogomips        : 5586.87

    clflush size    : 64

    cache_alignment : 64

    address sizes   : 46 bits physical, 48 bits virtual

    power management:

    processor       : 1

    vendor_id       : GenuineIntel

    cpu family      : 6

    model           : 106

    model name      : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz

    stepping        : 6

    microcode       : 0xffffffff

    cpu MHz         : 2793.436

    cache size      : 49152 KB

    physical id     : 0

    siblings        : 2

    core id         : 0

    cpu cores       : 1

    apicid          : 1

    initial apicid  : 1

    fpu             : yes

    fpu_exception   : yes

    cpuid level     : 21

    wp              : yes

    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep
    mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht
    syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology
    eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2
    movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
    3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase
    bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed
    adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec
    md_clear

    bogomips        : 5586.87

    clflush size    : 64

    cache_alignment : 64

    address sizes   : 46 bits physical, 48 bits virtual

    power management:

    This is how Slurm sees one of the nodes:

    [azccadmin@ricslurm-scheduler LID_CAVITY]$ scontrol show nodes

    NodeName=ricslurm-hpc-pg0-1 Arch=x86_64 CoresPerSocket=1

       CPUAlloc=0 CPUEfctv=1 CPUTot=1 CPULoad=0.88

       AvailableFeatures=cloud

       ActiveFeatures=cloud

       Gres=(null)

       NodeAddr=ricslurm-hpc-pg0-1 NodeHostName=ricslurm-hpc-pg0-1
    Version=22.05.3

       OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25
    17:23:54 UTC 2020

       RealMemory=3072 AllocMem=0 FreeMem=1854 Sockets=1 Boards=1

       State=IDLE+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
    MCS_label=N/A

       Partitions=hpc

       BootTime=2022-12-12T17:42:27 SlurmdStartTime=2022-12-12T17:42:28

       LastBusyTime=2022-12-12T17:52:29

       CfgTRES=cpu=1,mem=3G,billing=1

       AllocTRES=

       CapWatts=n/a

       CurrentWatts=0 AveWatts=0

       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

    This is the Slurm Job Control Script I have come up with to run
    the Vectis Job (I have set 5x Node, 1x CPU, and 2x Threads – is
    this right?):

    #!/bin/bash

    ## Job name

    #SBATCH --job-name=run-grma

    #

    ## File to write standard output and error

    #SBATCH --output=run-grma.out

    #SBATCH --error=run-grma.err

    #

    ## Partition for the cluster (you might not need that)

    #SBATCH --partition=hpc

    #

    ## Number of nodes

    #SBATCH --nodes=5

    #

    ## Number of CPUs per nodes

    #SBATCH --ntasks-per-node=1

    #

    ## Number of CPUs per task

    #SBATCH --cpus-per-task=2

    #

    ## General

    module purge

    ## Initialise VECTIS 2022.3b4

    if [ -d /shared/apps/RealisSimulation/2022.3/bin ]

    then

        export PATH=$PATH:/shared/apps/RealisSimulation/2022.3/bin

    else

        echo "Failed to Initialise VECTIS"

    fi

    ## Run

    vpre -V 2022.3 -np $SLURM_NTASKS
    /shared/data/LID_CAVITY/files/lid.GRD

    vsolve -V 2022.3 -np $SLURM_NTASKS -mpi intel_2018.4 -rdmu
    /shared/data/LID_CAVITY/files/lid_no_write.inp

    But, the submitted job will not run as it says that there is not
    enough CPUs.

    Here is the debug log from slurmctld – where you can see that it
    is saying the job has requested 10 CPUs (which is what I want),
    but the hpc partition only has 5 (which I think is wrong?):

    [2022-12-13T09:05:01.177] debug2: Processing RPC:
    REQUEST_NODE_INFO from UID=0

    [2022-12-13T09:05:01.370] debug2: Processing RPC:
    REQUEST_SUBMIT_BATCH_JOB from UID=20001

    [2022-12-13T09:05:01.371] debug3: _set_hostname: Using auth
    hostname for alloc_node: ricslurm-scheduler

    [2022-12-13T09:05:01.371] debug3: JobDesc: user_id=20001
    JobId=N/A partition=hpc name=run-grma

    [2022-12-13T09:05:01.371] debug3: cpus=10-4294967294
    pn_min_cpus=2 core_spec=-1

    [2022-12-13T09:05:01.371] debug3: Nodes=5-[5] Sock/Node=65534
    Core/Sock=65534 Thread/Core=65534

    [2022-12-13T09:05:01.371] debug3:
    pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1

    [2022-12-13T09:05:01.371] debug3: immediate=0 reservation=(null)

    [2022-12-13T09:05:01.371] debug3: features=(null)
    batch_features=(null) cluster_features=(null) prefer=(null)

    [2022-12-13T09:05:01.371] debug3: req_nodes=(null) exc_nodes=(null)

    [2022-12-13T09:05:01.371] debug3: time_limit=15-15 priority=-1
    contiguous=0 shared=-1

    [2022-12-13T09:05:01.371] debug3: kill_on_node_fail=-1
    script=#!/bin/bash

    ## Job name

    #SBATCH --job-n...

    [2022-12-13T09:05:01.371] debug3:
    argv="/shared/data/LID_CAVITY/slurm-runit.sh"

    [2022-12-13T09:05:01.371] debug3:
    
environment=XDG_SESSION_ID=12,HOSTNAME=ricslurm-scheduler,SELINUX_ROLE_REQUESTED=,...

    [2022-12-13T09:05:01.371] debug3: stdin=/dev/null
    stdout=/shared/data/LID_CAVITY/run-grma.out
    stderr=/shared/data/LID_CAVITY/run-grma.err

    [2022-12-13T09:05:01.372] debug3:
    work_dir=/shared/data/LID_CAVITY
    alloc_node:sid=ricslurm-scheduler:13464

    [2022-12-13T09:05:01.372] debug3: power_flags=

    [2022-12-13T09:05:01.372] debug3: resp_host=(null)
    alloc_resp_port=0 other_port=0

    [2022-12-13T09:05:01.372] debug3: dependency=(null)
    account=(null) qos=(null) comment=(null)

    [2022-12-13T09:05:01.372] debug3: mail_type=0 mail_user=(null)
    nice=0 num_tasks=5 open_mode=0 overcommit=-1 acctg_freq=(null)

    [2022-12-13T09:05:01.372] debug3: network=(null) begin=Unknown
    cpus_per_task=2 requeue=-1 licenses=(null)

    [2022-12-13T09:05:01.372] debug3: end_time= signal=0@0
    wait_all_nodes=-1 cpu_freq=

    [2022-12-13T09:05:01.372] debug3: ntasks_per_node=1
    ntasks_per_socket=-1 ntasks_per_core=-1 ntasks_per_tres=-1

    [2022-12-13T09:05:01.372] debug3: mem_bind=0:(null) plane_size:65534

    [2022-12-13T09:05:01.372] debug3: array_inx=(null)

    [2022-12-13T09:05:01.372] debug3: burst_buffer=(null)

    [2022-12-13T09:05:01.372] debug3: mcs_label=(null)

    [2022-12-13T09:05:01.372] debug3: deadline=Unknown

    [2022-12-13T09:05:01.372] debug3: bitflags=0x1a00c000
    delay_boot=4294967294

    [2022-12-13T09:05:01.372] debug3: job_submit/lua:
    slurm_lua_loadscript: skipping loading Lua script:
    /etc/slurm/job_submit.lua

    [2022-12-13T09:05:01.372] lua: Setting reqswitch to 1.

    [2022-12-13T09:05:01.372] lua: returning.

    [2022-12-13T09:05:01.372] debug2: _part_access_check: Job
    requested too many CPUs (10) of partition hpc(5)

    [2022-12-13T09:05:01.373] debug2: _part_access_check: Job
    requested too many CPUs (10) of partition hpc(5)

    [2022-12-13T09:05:01.373] debug2: JobId=1 can't run in partition
    hpc: More processors requested than permitted

    The job will run fine if I use the below settings (across 5
    nodes, but only using one of the two CPUs on each node):

    ## Number of nodes

    #SBATCH --nodes=5

    #

    ## Number of CPUs per nodes

    #SBATCH --ntasks-per-node=1

    #

    ## Number of CPUs per task

    #SBATCH --cpus-per-task=1

    Here is the successfully submitted Job details showing it using 5
    CPU’s (only one CPU per node) across 5x Nodes:

    [azccadmin@ricslurm-scheduler LID_CAVITY]$ scontrol show job 3

    JobId=3 JobName=run-grma

       UserId=azccadmin(20001) GroupId=azccadmin(20001) MCS_label=N/A

       Priority=4294901757 Nice=0 Account=(null) QOS=(null)

       JobState=RUNNING Reason=None Dependency=(null)

       Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

       RunTime=00:07:35 TimeLimit=00:15:00 TimeMin=N/A

       SubmitTime=2022-12-12T17:32:01 EligibleTime=2022-12-12T17:32:01

       AccrueTime=2022-12-12T17:32:01

       StartTime=2022-12-12T17:42:46 EndTime=2022-12-12T17:57:46
    Deadline=N/A

       SuspendTime=None SecsPreSuspend=0
    LastSchedEval=2022-12-12T17:32:01 Scheduler=Main

       Partition=hpc AllocNode:Sid=ricslurm-scheduler:11723

       ReqNodeList=(null) ExcNodeList=(null)

       NodeList=ricslurm-hpc-pg0-[1-5]

       BatchHost=ricslurm-hpc-pg0-1

    NumNodes=5 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

       TRES=cpu=5,mem=15G,node=5,billing=5

       Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*

       MinCPUsNode=1 MinMemoryCPU=3G MinTmpDiskNode=0

       Features=(null) DelayBoot=00:00:00

       OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

    Command=/shared/data/LID_CAVITY/slurm-runit.sh

       WorkDir=/shared/data/LID_CAVITY

    StdErr=/shared/data/LID_CAVITY/run-grma.err

       StdIn=/dev/null

    StdOut=/shared/data/LID_CAVITY/run-grma.out

       Switches=1@00:00:24

       Power=


    What am I doing wrong here - how do I get it to run the job on
    both CPU’s on all 5 nodes (i.e. fully utilising the available
    cluster resources of 10x CPUs)?

    Regards

    Gary

Re: [slurm-users] I can't seem to use all the CPUs in my Cluster?

Reply via email to