Hi all, thank you for the comment and input.
Yes, it is true, the uppercase is one of the main problem. After correcting the letter case, the job does not stuck anymore. However, as Daniel notices, there is memory problem. Running the same script, the job successfully passes the QOS limit. However, the job cannot be executed because of memory overlimit. Below is the job running output: slurmstepd: error: Job 90 exceeded memory limit (1188 > 1024), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: *** JOB 90 ON cn110 CANCELLED AT 2019-11-15T18:45:23 *** Attached is my slurm.conf It seems that no memory configuration there. Yet, I suffer this problem. Would anyone mind giving any comment or suggestion? Additionally, following is the limit setting for user Sukman. # sacctmgr show association where user=sukman format=user,grpTRES,grpwall,grptresmins,maxjobs,maxtres,maxtrespernode,maxwall,qos,defaultqos User GrpTRES GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxWall QOS Def QOS ---------- ------------- ----------- ------------- ------- ------------- -------------- ----------- -------------------- --------- sukman normal_compute Thanks. ------------------------------------------ Suksmandhira H ITB Indonesia ----- Original Message ----- From: "Daniel Letai" <d...@letai.org.il> To: slurm-users@lists.schedmd.com Sent: Thursday, November 14, 2019 10:51:10 PM Subject: Re: [slurm-users] Limiting the number of CPU 3 possible issue, inline below On 14/11/2019 14:58:29, Sukman wrote: Hi Brian, thank you for the suggestion. It appears that my node is in drain state. I rebooted the node and everything became fine. However, the QOS still cannot be applied properly. Do you have any opinion regarding this issue? $ sacctmgr show qos where Name=normal_compute format=Name,Priority,MaxWal,MaxTRESPU Name Priority MaxWall MaxTRESPU ---------- ---------- ----------- ------------- normal_co+ 10 00:01:00 cpu=2,mem=1G when I run the following script: #!/bin/bash #SBATCH --job-name=hostname #sbatch --time=00:50 #sbatch --mem=1M I believe those should be uppercase #SBATCH #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --nodelist=cn110 srun hostname It turns out that the QOSMaxMemoryPerUser has been met $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 88 defq hostname sukman PD 0:00 1 (QOSMaxMemoryPerUser) $ scontrol show job 88 JobId=88 JobName=hostname UserId=sukman(1000) GroupId=nobody(1000) MCS_label=N/A Priority=4294901753 Nice=0 Account=user QOS=normal_compute JobState=PENDING Reason=QOSMaxMemoryPerUser Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2019-11-14T19:49:37 EligibleTime=2019-11-14T19:49:37 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-11-14T19:55:50 Partition=defq AllocNode:Sid=itbhn02:51072 ReqNodeList=cn110 ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=257758M MinTmpDiskNode=0 MinMemoryNode seems to require more than FreeMem in Node below Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/sukman/script/test_hostname.sh WorkDir=/home/sukman/script StdErr=/home/sukman/script/slurm-88.out StdIn=/dev/null StdOut=/home/sukman/script/slurm-88.out Power= $ scontrol show node cn110 NodeName=cn110 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=56 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=cn110 NodeHostName=cn110 Version=17.11 OS=Linux 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 RealMemory=257758 AllocMem=0 FreeMem=255742 Sockets=56 Boards=1 This would appear to be wrong - 56 sockets? How did you configure the node in slurm.conf? FreeMem lower than MinMemoryNode - not sure if that is relevant. State=IDLE ThreadsPerCore=1 TmpDisk=268629 Weight=1 Owner=N/A MCS_label=N/A Partitions=defq BootTime=2019-11-14T18:50:56 SlurmdStartTime=2019-11-14T18:53:23 CfgTRES=cpu=56,mem=257758M,billing=56 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s --------------------------------------- Sukman ITB Indonesia ----- Original Message ----- From: "Brian Andrus" <toomuc...@gmail.com> To: slurm-users@lists.schedmd.com Sent: Tuesday, November 12, 2019 10:41:42 AM Subject: Re: [slurm-users] Limiting the number of CPU You are trying to specifically run on node cn110, so you may want to check that out with sinfo A quick "sinfo -R" can list any down machines and the reasons. Brian Andrus -- Regards, Daniel Letai +972 (0)505 870 456
ReturnToService=2 TaskPlugin=task/cgroup # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd # ACCOUNTING # Limit Enforcement AccountingStorageEnforce=qos,limits JobAcctGatherType=jobacct_gather/linux AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm # CONSUMABLE RESOURCES # #SelectType=select/linear SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory # Scheduler SchedulerType=sched/backfill # Nodes NodeName=cn[100-113,115-128] Procs=56 # Partitions PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF Re OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=cn[100-113,115-128] # Generic resources types GresTypes=gpu,mic # Epilog/Prolog parameters PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob Prolog=/cm/local/apps/cmd/scripts/prolog Epilog=/cm/local/apps/cmd/scripts/epilog # Fast Schedule option FastSchedule=0 # Power Saving SuspendTime=-1 # this disables power saving SuspendTimeout=30 ResumeTimeout=60 SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron