Are you looking at the man page on the SchedMD website, or on your
computer? If you're looking at the website, those pages are for the
latest version and may not match what you have installed, so this could
be a feature in a later version tha 18.08.
--
Prentice
On 8/7/20 11:43 AM, Hanby, Mike
Dear Slurm Community,
We recognize that the SlurmdTimeout has a default value of 300 seconds, and
that if the controller is unable to communicate with a node during that
time it will mark it down. We have two questions regarding this:
1. Won't also individual compute nodes kill their own jobs if
I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= or
COREs= settings. Currently, they’re:
NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] COREs=0-7,9-15
and I’ve got 2 jobs currently running on each node that’s available.
So maybe:
NodeName=c0005
HI Tina,
Thank you so much for looking at this.
slurm 18.08.8
nvidia-smi topo -m
!sysGPU0GPU1GPU2GPU3mlx5_0 CPU Affinity
GPU0 X NV2 NV2 NV2 NODE
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-
This is what's in /var/log/slurmctld
Invalid node state transition requested for node c01 from=DRAINING
to=CANCEL_REBOOT
So it looks like, for version 18.08 at least, you have to first undrain, then
cancel reboot:
scontrol update NodeName="c01" State=undrain Reason="cancelling reboot"
scontrol
Hi Jodie,
what version of SLURM are you using? I'm pretty sure newer versions pick
the topology up automatically (although I'm on 18.08 so I can't verify
that).
Is what you're wanting to do - basically - forcefully feed a 'wrong'
gres.conf to make SLURM assume all GPUs are on one CPU? (I don
Howdy, (Slurm 18.08)
We have a bunch of node that we've updated to "scontrol reboot ASAP".
We'd like to cancel a few of those. From the man page, it's suggested that
either of the following should work, however both report the same error "
slurm_update error: Invalid node state specified":
sco
Tina,
Thank you. Yes, jobs will run on all 4 gpus if I submit with:
--gres-flags=disable-binding
Yet my goal is to have the gpus bind to a cpu in order to allow a cpu-only job
to never run on that particular cpu (having it bound to the gpu and always
free for a gpu job) and give the cpu job t
Hello,
This is something I've seen once on our systems & it took me a while to
figure out what was going on.
The solution was that the system topology was such that all GPUs were
connected to one CPU. There were no free cores on that particular CPU;
so SLURM did not schedule any more jobs to
Good morning.
I have having the same experience here. Wondering if you had a resolution?
Thank you.
Jodie
On Jun 11, 2020, at 3:27 PM, Rhian Resnick
mailto:rresn...@fau.edu>> wrote:
We have several users submitting single GPU jobs to our cluster. We expected
the jobs to fill each node and fu
My rule of thumb is that the MaxJobs for the entire cluster is twice the
number of cores you have available. That way you have enough jobs
running to fill all the cores and enough jobs pending to refill them.
As for per user MaxJobs, it just depends on what you think the maximum
number any u
Hi All,
I have configured in our test cluster "PrivateData" parameter in
"slurm.conf" as below.
>>
[testuser1@centos7vm01 ~]$ cat /etc/slurm/clurm.conf|less
PrivateData=accounts,jobs,reservations,usage,users,events,partitions,nodes
MCSPlugin=mcs/user
MCSParameters=enforced,select,privatedat
12 matches
Mail list logo