Re: [slurm-users] Limiting the number of CPU

Sukman Fri, 15 Nov 2019 23:53:11 -0800

Hi all,

thank you for the comment and input.


Yes, it is true, the uppercase is one of the main problem.

After correcting the letter case, the job does not stuck anymore.
However, as Daniel notices, there is memory problem.

Running the same script, the job successfully passes the QOS limit.
However, the job cannot be executed because of memory overlimit.

Below is the job running output:

slurmstepd: error: Job 90 exceeded memory limit (1188 > 1024), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 90 ON cn110 CANCELLED AT 2019-11-15T18:45:23 ***


Attached is my slurm.conf
It seems that no memory configuration there.
Yet, I suffer this problem.

Would anyone mind giving any comment or suggestion?


Additionally, following is the limit setting for user Sukman.

# sacctmgr show association where user=sukman 
format=user,grpTRES,grpwall,grptresmins,maxjobs,maxtres,maxtrespernode,maxwall,qos,defaultqos
      User       GrpTRES     GrpWall   GrpTRESMins MaxJobs       MaxTRES 
MaxTRESPerNode     MaxWall                  QOS   Def QOS 
---------- ------------- ----------- ------------- ------- ------------- 
-------------- ----------- -------------------- --------- 
    sukman                                                                      
                          normal_compute



Thanks.


------------------------------------------

Suksmandhira H
ITB Indonesia




----- Original Message -----
From: "Daniel Letai" <[email protected]>
To: [email protected]
Sent: Thursday, November 14, 2019 10:51:10 PM
Subject: Re: [slurm-users] Limiting the number of CPU



3 possible issue, inline below 



On 14/11/2019 14:58:29, Sukman wrote: 


Hi Brian,

thank you for the suggestion.

It appears that my node is in drain state.
I rebooted the node and everything became fine.

However, the QOS still cannot be applied properly.
Do you have any opinion regarding this issue?


$ sacctmgr show qos where Name=normal_compute 
format=Name,Priority,MaxWal,MaxTRESPU
      Name   Priority     MaxWall     MaxTRESPU
---------- ---------- ----------- -------------
normal_co+         10    00:01:00  cpu=2,mem=1G


when I run the following script:

#!/bin/bash
#SBATCH --job-name=hostname
#sbatch --time=00:50
#sbatch --mem=1M I believe those should be uppercase #SBATCH 


#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --nodelist=cn110

srun hostname


It turns out that the QOSMaxMemoryPerUser has been met

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
                88      defq hostname   sukman PD       0:00      1 
(QOSMaxMemoryPerUser)


$ scontrol show job 88
JobId=88 JobName=hostname
   UserId=sukman(1000) GroupId=nobody(1000) MCS_label=N/A
   Priority=4294901753 Nice=0 Account=user QOS=normal_compute
   JobState=PENDING Reason=QOSMaxMemoryPerUser Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2019-11-14T19:49:37 EligibleTime=2019-11-14T19:49:37
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-11-14T19:55:50
   Partition=defq AllocNode:Sid=itbhn02:51072
   ReqNodeList=cn110 ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=257758M MinTmpDiskNode=0 MinMemoryNode seems to 
require more than FreeMem in Node below 


Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/sukman/script/test_hostname.sh
   WorkDir=/home/sukman/script
   StdErr=/home/sukman/script/slurm-88.out
   StdIn=/dev/null
   StdOut=/home/sukman/script/slurm-88.out
   Power=


$ scontrol show node cn110
NodeName=cn110 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=56 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=cn110 NodeHostName=cn110 Version=17.11
   OS=Linux 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017
   RealMemory=257758 AllocMem=0 FreeMem=255742 Sockets=56 Boards=1 

This would appear to be wrong - 56 sockets? 

How did you configure the node in slurm.conf? 

FreeMem lower than MinMemoryNode - not sure if that is relevant. 


State=IDLE ThreadsPerCore=1 TmpDisk=268629 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defq
   BootTime=2019-11-14T18:50:56 SlurmdStartTime=2019-11-14T18:53:23
   CfgTRES=cpu=56,mem=257758M,billing=56
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


---------------------------------------

Sukman
ITB Indonesia




----- Original Message -----
From: "Brian Andrus" <[email protected]> To: [email protected] 
Sent: Tuesday, November 12, 2019 10:41:42 AM
Subject: Re: [slurm-users] Limiting the number of CPU

You are trying to specifically run on node cn110, so you may want to 
check that out with sinfo

A quick "sinfo -R" can list any down machines and the reasons.

Brian Andrus -- 
Regards,

Daniel Letai
+972 (0)505 870 456

ReturnToService=2
TaskPlugin=task/cgroup

# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0

# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd

# ACCOUNTING

# Limit Enforcement
AccountingStorageEnforce=qos,limits

JobAcctGatherType=jobacct_gather/linux
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm

# CONSUMABLE RESOURCES
#
#SelectType=select/linear
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory

# Scheduler
SchedulerType=sched/backfill

# Nodes
NodeName=cn[100-113,115-128]  Procs=56

# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL PriorityJobFactor=1 
PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 
PreemptMode=OFF Re
 OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=cn[100-113,115-128]

# Generic resources types
GresTypes=gpu,mic

# Epilog/Prolog parameters
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob
Prolog=/cm/local/apps/cmd/scripts/prolog
Epilog=/cm/local/apps/cmd/scripts/epilog

# Fast Schedule option
FastSchedule=0

# Power Saving
SuspendTime=-1 # this disables power saving
SuspendTimeout=30
ResumeTimeout=60
SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff
ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron

Re: [slurm-users] Limiting the number of CPU

Reply via email to