The available features / constraints aren't necessary; their purpose is
to offer a slightly more flexible way to request resources (esp. GPU).
As in, quite often people don't specifically need a P100 or V100, but
they can't run on a Kepler card; with the '--gres=gpu:p100:X' syntax
they can (I b
OK, after looking at your configs, I noticed that I was missing a
"Gres=gpu" entry on my Nodename definition. Added and distributed...
NodeName=tiger11 NodeAddr=X.X.X.X Sockets=2 CoresPerSocket=12
ThreadsPerCore=2 Gres=gpu:1080gtx:0,gpu:k20:1 Feature=HyperThread
State=UNKNOWN
Assuming that 0 and 1
Hello,
don't mind sharing the config at all. Not sure it helps though, it's
pretty basic.
Picking an example node, I have
[ ~]$ scontrol show node arcus-htc-gpu011
NodeName=arcus-htc-gpu011 Arch=x86_64 CoresPerSocket=8
CPUAlloc=16 CPUTot=16 CPULoad=20.43
AvailableFeatures=cpu_gen:Haswell,
Tina, thanks for confirming that GPU GRES resources work with 18.08... I
might just upgrade to 18.08.03 as I am running 18.08.0
The nvidia devices exists on all servers and persistence is set. They have
been in there for a number of years and our users make use of them daily. I
can actually see th
I'm running 18.08.3, and I have a fair number of GPU GRES resources -
recently upgraded to 18.08.03 from a 17.x release. It's definitely not
as if they don't work in an 18.x release. (I do not distribute the same
gres.conf file everywhere though, never tried that.)
Just a really stupid question
Only thing to suggest once again is increasing the logging of both slurmctl and
slurmd.
As for downgrading, I wouldn't suggest running a 17.x slurmdbd against a db
built with 18.x. I imagine there are enough changes there to cause trouble.
I don't imagine downgrading will fix your issue, if you
Brian, I used a single gres.conf file and distributed to all nodes...
Restarted all daemons, unfortunately scontrol still does not show any Gres
resources for GPU nodes...
Will try to roll back to 17.X release. Is it basically a matter of removing
18.x rpms and installing 17's? Does the DB need to
Do one more pass through making sure
s/1080GTX/1080gtx and s/K20/k20
shutdown all slurmd, slurmctld, start slurmctl, start slurmd
I find it less confusing to have a global gres.conf file. I haven't used a list
(nvidia[0-1), mainly because I want to specify thethe cores to use for each gpu.
Brian, the specific node does not show any gres...
root@panther02 slurm# scontrol show partition=tiger_1
PartitionName=tiger_1
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
Max
Thanks Michael. I will try 17.x as I also could not see anything wrong with
my settings... Will report back afterwards...
Lou
On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico
wrote:
> unfortunately, someone smarter then me will have to help further. I'm
> not sure i see anything specifically
As Michael had suggested earlier, debugflags=gpu will give you detailed output
of the gres being reported by the nodes. This would be in the slurmctld log.
Or, show us the output of 'scontrol show node=tiger[01-02]' and 'scontrol show
partition=tiger_1'
From your previous message, that should
unfortunately, someone smarter then me will have to help further. I'm
not sure i see anything specifically wrong. The one thing i might try
is backing the software down to a 17.x release series. I recently
tried 18.x and had some issues. I can't say whether it'll be any
different, but you might
Made the change in the gres.conf on local server file and restarted slurmd
and slurmctld on master Unfortunately same error...
Distributed corrected gres.conf to all k20 servers, restarted slurmd and
slurmdctl... Still has same error...
On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson wrot
Is that a lowercase k in k20 specified in the batch script and nodename and a
uppercase K specified in gres.conf?
On 12/03/2018 09:13 AM, Lou Nicotra wrote:
Hi All, I have recently set up a slurm cluster with my servers and I'm running
into an issue while submitting GPU jobs. It has something t
Here you go... Thanks for looking into this...
lnicotra@tiger11 run# scontrol show config
Configuration data as of 2018-12-03T15:39:51
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost = panther02
AccountingStorageLoc= N/A
AccountingStoragePort = 681
are you willing to paste an `scontrol show config` from the machine
having trouble
On Mon, Dec 3, 2018 at 12:10 PM Lou Nicotra wrote:
>
> I'm running slurmd version 18.08.0...
>
> It seems that the system recognizes the GPUs after a slurmd restart. I tuned
> debug to 5, restarted and then submit
I'm running slurmd version 18.08.0...
It seems that the system recognizes the GPUs after a slurmd restart. I
tuned debug to 5, restarted and then submitted job. Nothing get logged to
log file in local server...
[2018-12-03T11:55:18.442] Slurmd shutdown completing
[2018-12-03T11:55:18.484] debug:
do you get anything additional in the slurm logs? have you tried
adding gres to the debugflags? what version of slurm are you running?
On Mon, Dec 3, 2018 at 9:18 AM Lou Nicotra wrote:
>
> Hi All, I have recently set up a slurm cluster with my servers and I'm
> running into an issue while submi
Hi All, I have recently set up a slurm cluster with my servers and I'm
running into an issue while submitting GPU jobs. It has something to to
with gres configurations, but I just can't seem to figure out what is
wrong. Non GPU jobs run fine.
The error is as follows:
sbatch: error: Batch job submi
19 matches
Mail list logo