3 possible issue, inline below
On 14/11/2019 14:58:29, Sukman wrote:
I believe those should be uppercase #SBATCHHi Brian, thank you for the suggestion.It appears that my node is in drain state. I rebooted the node and everything became fine. However, the QOS still cannot be applied properly. Do you have any opinion regarding this issue? $ sacctmgr show qos where Name=normal_compute format=Name,Priority,MaxWal,MaxTRESPU Name Priority MaxWall MaxTRESPU ---------- ---------- ----------- ------------- normal_co+ 10 00:01:00 cpu=2,mem=1G when I run the following script: #!/bin/bash #SBATCH --job-name=hostname #sbatch --time=00:50 #sbatch --mem=1M MinMemoryNode seems to require more than FreeMem in Node below#SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --nodelist=cn110 srun hostname It turns out that the QOSMaxMemoryPerUser has been met $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 88 defq hostname sukman PD 0:00 1 (QOSMaxMemoryPerUser) $ scontrol show job 88 JobId=88 JobName=hostname UserId=sukman(1000) GroupId=nobody(1000) MCS_label=N/A Priority=4294901753 Nice=0 Account=user QOS=normal_compute JobState=PENDING Reason=QOSMaxMemoryPerUser Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2019-11-14T19:49:37 EligibleTime=2019-11-14T19:49:37 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-11-14T19:55:50 Partition=defq AllocNode:Sid=itbhn02:51072 ReqNodeList=cn110 ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=257758M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/sukman/script/test_hostname.sh WorkDir=/home/sukman/script StdErr=/home/sukman/script/slurm-88.out StdIn=/dev/null StdOut=/home/sukman/script/slurm-88.out Power= $ scontrol show node cn110 NodeName=cn110 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=56 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=cn110 NodeHostName=cn110 Version=17.11 OS=Linux 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 RealMemory=257758 AllocMem=0 FreeMem=255742 Sockets=56 Boards=1 This would appear to be wrong - 56 sockets? How did you configure the node in slurm.conf? FreeMem lower than MinMemoryNode - not sure if that is relevant. State=IDLE ThreadsPerCore=1 TmpDisk=268629 Weight=1 Owner=N/A MCS_label=N/A Partitions=defq BootTime=2019-11-14T18:50:56 SlurmdStartTime=2019-11-14T18:53:23 CfgTRES=cpu=56,mem=257758M,billing=56 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s --------------------------------------- Sukman ITB Indonesia ----- Original Message ----- From: "Brian Andrus" <toomuc...@gmail.com> To: slurm-users@lists.schedmd.com Sent: Tuesday, November 12, 2019 10:41:42 AM Subject: Re: [slurm-users] Limiting the number of CPU You are trying to specifically run on node cn110, so you may want to check that out with sinfo A quick "sinfo -R" can list any down machines and the reasons. Brian Andrus -- Regards, Daniel Letai +972 (0)505 870 456 |
- [slurm-users] Limiting the number of CPU Sukman
- Re: [slurm-users] Limiting the number of CPU Brian W. Johanson
- Re: [slurm-users] Limiting the number of CPU Sukman
- Re: [slurm-users] Limiting the number of CP... Brian Andrus
- Re: [slurm-users] Limiting the number o... Sukman
- Re: [slurm-users] Limiting the num... Nguyen Dai Quy
- Re: [slurm-users] Limiting the... Henkel, Andreas
- Re: [slurm-users] Limiting the num... Henkel, Andreas
- Re: [slurm-users] Limiting the num... Daniel Letai