Re: [slurm-users] Pending with resource problems

Prentice Bisbal Wed, 17 Apr 2019 08:24:33 -0700

Mahmood,

What do you see as the problem here? To me, there is no problem and thescheduler is working exactly has it should. The reason "Resources" meansthat there are not enough computing resources available for your job torun right now, so the job is setting in the queue in the pending statewaiting for the necessary resources to become available. This is exactlywhat schedulers are

As Andreas pointed out, looking at the output of 'scontrol show nodecompute-0-0' that you provided, compute-0-0 has 32 cores and 63 GB ofRAM. Out of that 9 cores and 55 GB of RAM have already been allocated,leaving 23 cores and only 8 GB of RAM available for other jobs. The jobyou submitted requested 20 cores (tasks, technically) and 40 GB of RAM.Since compute-0-0 doesn't have enough RAM available, Slurm is keepingyour job in the queue until enough RAM is available for it to run. Thisis exactly what Slurm should be doing.


Prentice

On 4/17/19 11:00 AM, Henkel, Andreas wrote:

I think there isn’t enough memory.
AllocTres Shows mem=55G
And your job wants another 40G although the node only has 63G in total.
Best,
Andreas

Am 17.04.2019 um 16:45 schrieb Mahmood Naderan <mahmood...@gmail.com<mailto:mahmood...@gmail.com>>:

Hi,

Although it was fine for previous job runs, the following script nowstuck as PD with the reason about resources.


$ cat slurm_script.sh
#!/bin/bash
#SBATCH --output=test.out
#SBATCH --job-name=g09-test
#SBATCH --ntasks=20
#SBATCH --nodelist=compute-0-0
#SBATCH --mem=40GB
#SBATCH --account=z7
#SBATCH --partition=EMERALD
g09 test.gjf
$ sbatch slurm_script.sh
Submitted batch job 878
$ squeue

JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON) 878 EMERALD g09-test shakerza PD 0:00 1(Resources)




However, all things look good.

$ sacctmgr list association format=user,account,partition,grptres%20| grep shaker

shakerzad+      local
shakerzad+         z7    emerald cpu=20,mem=40G
$ scontrol show node compute-0-0
NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=9 CPUTot=32 CPULoad=8.89
   AvailableFeatures=rack-0,32CPUs
   ActiveFeatures=rack-0,32CPUs
   Gres=(null)
   NodeAddr=10.1.1.254 NodeHostName=compute-0-0 Version=18.08
   OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
   RealMemory=64261 AllocMem=56320 FreeMem=37715 Sockets=32 Boards=1

State=MIXED ThreadsPerCore=1 TmpDisk=444124 Weight=20511900Owner=N/A MCS_label=N/A

   Partitions=CLUSTER,WHEEL,EMERALD,QUARTZ
   BootTime=2019-04-06T10:03:47 SlurmdStartTime=2019-04-06T10:05:54
   CfgTRES=cpu=32,mem=64261M,billing=47
   AllocTRES=cpu=9,mem=55G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Any idea?

Regards,
Mahmood

Re: [slurm-users] Pending with resource problems

Reply via email to