Please include the output for:
scontrol show node=liqidos-dean-node1
scontrol show partition=Partition_you_are_attempting_to_submit_to
and
any other #SBATCH lines submitted with the failing job.
On 2/4/20 9:42 AM, dean.w.schu...@gmail.com wrote:
I've already restarted slurmctld and slurmd on all nodes. Still get the same
problem.
-----Original Message-----
From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Marcus
Wagner
Sent: Tuesday, February 4, 2020 2:31 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] sbatch script won't accept --gres that requires more
than 1 gpu
Hi Dean,
could you please try to restart the slurmctld?
This usually helps on our site.
Never saw this with gres happening, but many other times.
This is, why we restart slurmctld once a day by a cron job.
Best
Marcus
On 2/4/20 12:59 AM, Dean Schulze wrote:
When I run an sbatch script with the line
#SBATCH --gres=gpu:gp100:1
it runs. When I change it to
#SBATCH --gres=gpu:gp100:3
it fails with "Requested node configuration is not available". But I
have a node with 4 gp100s available. Here's my slurm.conf:
NodeName=liqidos-dean-node1 CPUs=2 Boards=1 SocketsPerBoard=2
CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3770 Gres=gpu:gp100:4
That node has a gres.conf with these lines:
Name=gpu Type=gp100 File=/dev/nvidia0 Name=gpu Type=gp100
File=/dev/nvidia1 Name=gpu Type=gp100 File=/dev/nvidia2 Name=gpu
Type=gp100 File=/dev/nvidia3
The character devices all exist in /dev.
What's the controller complaining about?
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de