Re: [slurm-users] sbatch script won't accept --gres that requires more than 1 gpu

Brian W. Johanson Tue, 04 Feb 2020 12:37:28 -0800

Please include the output for:
scontrol show node=liqidos-dean-node1
scontrol show partition=Partition_you_are_attempting_to_submit_to
and
any other #SBATCH lines submitted with the failing job.




On 2/4/20 9:42 AM, dean.w.schu...@gmail.com wrote:

I've already restarted slurmctld and slurmd on all nodes.  Still get the same 
problem.

-----Original Message-----
From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Marcus 
Wagner
Sent: Tuesday, February 4, 2020 2:31 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] sbatch script won't accept --gres that requires more 
than 1 gpu

Hi Dean,

could you please try to restart the slurmctld?

This usually helps on our site.
Never saw this with gres happening, but many other times.
This is, why we restart slurmctld once a day by a cron job.

Best
Marcus

On 2/4/20 12:59 AM, Dean Schulze wrote:

When I run an sbatch script with the line

#SBATCH --gres=gpu:gp100:1

it runs.  When I change it to

#SBATCH --gres=gpu:gp100:3

it fails with "Requested node configuration is not available".  But I
have a node with 4 gp100s available.  Here's my slurm.conf:

NodeName=liqidos-dean-node1 CPUs=2 Boards=1 SocketsPerBoard=2
CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3770 Gres=gpu:gp100:4

That node has a gres.conf with these lines:

Name=gpu Type=gp100  File=/dev/nvidia0 Name=gpu Type=gp100
File=/dev/nvidia1 Name=gpu Type=gp100  File=/dev/nvidia2 Name=gpu
Type=gp100  File=/dev/nvidia3

The character devices all exist in /dev.

What's the controller complaining about?

--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Re: [slurm-users] sbatch script won't accept --gres that requires more than 1 gpu

Reply via email to