Re: [slurm-users] Limiting the number of CPU

Daniel Letai Thu, 14 Nov 2019 07:54:26 -0800

3 possible issue, inline below

On 14/11/2019 14:58:29, Sukman wrote:

Hi Brian,

thank you for the suggestion.


It appears that my node is in drain state.
I rebooted the node and everything became fine.

However, the QOS still cannot be applied properly.
Do you have any opinion regarding this issue?


$ sacctmgr show qos where Name=normal_compute format=Name,Priority,MaxWal,MaxTRESPU
      Name   Priority     MaxWall     MaxTRESPU
---------- ---------- ----------- -------------
normal_co+         10    00:01:00  cpu=2,mem=1G


when I run the following script:

#!/bin/bash
#SBATCH --job-name=hostname
#sbatch --time=00:50
#sbatch --mem=1M

I believe those should be uppercase #SBATCH

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --nodelist=cn110

srun hostname


It turns out that the QOSMaxMemoryPerUser has been met

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                88      defq hostname   sukman PD       0:00      1 (QOSMaxMemoryPerUser)


$ scontrol show job 88
JobId=88 JobName=hostname
   UserId=sukman(1000) GroupId=nobody(1000) MCS_label=N/A
   Priority=4294901753 Nice=0 Account=user QOS=normal_compute
   JobState=PENDING Reason=QOSMaxMemoryPerUser Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2019-11-14T19:49:37 EligibleTime=2019-11-14T19:49:37
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-11-14T19:55:50
   Partition=defq AllocNode:Sid=itbhn02:51072
   ReqNodeList=cn110 ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=257758M MinTmpDiskNode=0

MinMemoryNode seems to require more than FreeMem in Node below

   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/sukman/script/test_hostname.sh
   WorkDir=/home/sukman/script
   StdErr=/home/sukman/script/slurm-88.out
   StdIn=/dev/null
   StdOut=/home/sukman/script/slurm-88.out
   Power=


$ scontrol show node cn110
NodeName=cn110 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=56 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=cn110 NodeHostName=cn110 Version=17.11
   OS=Linux 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017
   RealMemory=257758 AllocMem=0 FreeMem=255742 Sockets=56 Boards=1

This would appear to be wrong - 56 sockets?

How did you configure the node in slurm.conf?

FreeMem lower than MinMemoryNode - not sure if that is relevant.

   State=IDLE ThreadsPerCore=1 TmpDisk=268629 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defq
   BootTime=2019-11-14T18:50:56 SlurmdStartTime=2019-11-14T18:53:23
   CfgTRES=cpu=56,mem=257758M,billing=56
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


---------------------------------------

Sukman
ITB Indonesia




----- Original Message -----
From: "Brian Andrus" <toomuc...@gmail.com>
To: slurm-users@lists.schedmd.com
Sent: Tuesday, November 12, 2019 10:41:42 AM
Subject: Re: [slurm-users] Limiting the number of CPU

You are trying to specifically run on node cn110, so you may want to 
check that out with sinfo

A quick "sinfo -R" can list any down machines and the reasons.

Brian Andrus

-- 
Regards,

Daniel Letai
+972 (0)505 870 456

Re: [slurm-users] Limiting the number of CPU

Reply via email to