Hi Hugo.
i[3-9] have 2 kinds of cores: the more performant ones with
hyperthreading and the slower ones without.
From
https://www.intel.com/content/www/us/en/products/docs/processors/core/core-14th-gen-desktop-brief.html
:
-8<--
These processors feature performance hybrid architecture1, com
=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:gpu03
Dependency=(null)
Paint me surprised...
Diego
Il 07/12/2024 10:03, Diego Zuccato via slurm-users ha scritto:
Ciao Davide.
Il 06/12/2024 16:42, Davide DelVento ha scritto:
I find it extremely hard to understand situations like this. I wish
#x27;long' (10, IIRC).
Diego
On Fri, Dec 6, 2024 at 7:36 AM Diego Zuccato via slurm-users us...@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>> wrote:
Hello all.
An user reported that a job wasn't starting, so I tried to replicate
the
request and
Hello all.
An user reported that a job wasn't starting, so I tried to replicate the
request and I get:
-8<--
[root@ophfe1 root.old]# scontrol show job 113936
JobId=113936 JobName=test.sh
UserId=root(0) GroupId=root(0) MCS_label=N/A
Priority=1 Nice=0 Account=root QOS=long
JobState=PENDIN
IIUC, when you suspend a job it remains in memory but with no CPU time
allocated. If you reboot the node, the job state is lost (unless it uses
checkpointing). When you restarted the jobs, they actually began a new
run (Slurm doesn't know if they use checkpointing or not). You've been
lucky tha
And, if it's a device (like a PCIe board), can it be shared between
processes or not?
If it's shareable (like a network interface) you can configure it as a
feature. If it's not you have to make it a tres (and possibly configure
cgroups to deny access from jobs that did not request it).
Diego
Seems the perfect use case for heterogeneous jobs...
Diego
Il 31/10/2024 14:18, Davide DelVento via slurm-users ha scritto:
Another possible use case of this is a regular MPI job where the first/
controller task often uses more memory than the workers and may need to
be scheduled on a higher m
Hint: round down a bit the RAM reported by 'slurmd -C'. Or you risk the
nodes not coming back up after an upgrade that leaves a bit less free
RAM than configured.
Diego
Il 10/07/2024 17:29, Brian Andrus via slurm-users ha scritto:
Jack,
To make sure things are set right, run 'slurmd -C' on t
IIUC you can't do that.
You either allow overcommit or you split your job in multiple, smaller
jobs that fit.
The resources you're requesting must be available at the same time: if
your job needs 2 CPUs and you want to run it in parallel, just use a job
array. If you request 500 CPUs it mean
Try adding to the config:
EnforcePartLimits=ANY
JobSubmitPlugins=all_partitions
Diego
Il 30/04/2024 15:11, Dietmar Rieder via slurm-users ha scritto:
Hi Loris,
On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:
Hi Dietmar,
Dietmar Rieder via slurm-users writes:
Hi,
is it possible t
Il 06/03/2024 13:49, Gestió Servidors via slurm-users ha scritto:
And how can I reject the job inside the lua script?
Just use
return slurm.FAILURE
and job will be refused.
--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
I don't know why that happens (other than you're opening a comment and
not closing it, IIUC), but it would probably be less surprising to just
reject the submission than reduce the limit.
In the (rare...) case the user actually needs all the time requested,
you risk wasting resources. If you rej
12 matches
Mail list logo