On 6/16/20 8:16 am, David Baker wrote:
We are running Slurm v19.05.5 and I am experimenting with the *scontrol
reboot * command. I find that compute nodes reboot, but they are not
returned to service. Rather they remain down following the reboot..
How are you using "scontrol reboot" ?
We do:
Hello,
We are running Slurm v19.05.5 and I am experimenting with the scontrol reboot
command. I find that compute nodes reboot, but they are not returned to
service. Rather they remain down following the reboot..
navy55 1debug*down 80 2:20:2 1920000 2000
(nu
Diego Zuccato writes:
> Il 16/06/20 09:39, Loris Bennett ha scritto:
>
>>> Maybe it's already known and obvious, but... Remember that a node can be
>>> allocated to only one partition.
>> Maybe I am misunderstanding you, but I think that this is not the case.
>> A node can be in multiple partitio
Not trying to argue unnecessarily, but what you describe is not a universal
rule, regardless of QOS.
Our GPU nodes are members of 3 GPU-related partitions, 2 more resource-limited
non-GPU partitions, and one of two larger-memory partitions. It’s set up this
way to minimize idle resources (due t
If you check the source up on Github, that's more of a warning produced when
you didn't specify a CPU count and it's going to calculate from the
socket-core-thread numbers (src/common/read_config.c):
/* Node boards are factored into sockets */
if ((n->cpus != n-
Thank you. We are planning to put ssh keys on login nodes only and use the
PAM module to control access to compute nodes. Will such a setup work? Or
is it necessary for PAM to work to have the ssh keys on the compute nodes
as well? I'm sorry but this is not clearly mentioned on any
documentation..
Il 16/06/20 09:39, Loris Bennett ha scritto:
>> Maybe it's already known and obvious, but... Remember that a node can be
>> allocated to only one partition.
> Maybe I am misunderstanding you, but I think that this is not the case.
> A node can be in multiple partitions.
*Assigned* to multiple part
Hi David,
if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
starts from zero. So this is NOT the index of the GPU.
Just verified it:
$> nvidia-smi
Tue Jun 16 13:28:47 2020
+-+
| NVIDIA-SMI 440.44
Today we upgraded the controller node from 19.05 to 20.02.3, and
immediately all Slurm commands (on the controller node) give error
messages for all partitions:
# sinfo --version
sinfo: error: NodeNames=a[001-140] CPUs=1 match no Sockets,
Sockets*CoresPerSocket or Sockets*CoresPerSocket*Thread
Hi Ahmet,
On 6/16/20 11:27 AM, mercan wrote:
Did you check /var/log/messages file for errors. Systemctl logs this file,
instead of the slurmctl log file.
Ahmet M.
The syslog reports the same errors from slurmctld as are being reported by
every Slurm 20.02 command.
I have found a workaroun
Hi;
Did you check /var/log/messages file for errors. Systemctl logs this
file, instead of the slurmctl log file.
Ahmet M.
16.06.2020 11:12 tarihinde Ole Holm Nielsen yazdı:
Today we upgraded the controller node from 19.05 to 20.02.3, and
immediately all Slurm commands (on the controller nod
Diego Zuccato writes:
> Il 13/06/20 17:47, navin srivastava ha scritto:
>
>> Yes we have separate partitions. Some are specific to gpu having 2 nodes
>> with 8 gpu and another partitions are mix of both,nodes with 2 gpu and
>> very few nodes are without any gpu.
> Maybe it's already known and ob
12 matches
Mail list logo