Re: [slurm-users] Nodes do not return to service after scontrol reboot

2020-06-16 Thread Christopher Samuel
On 6/16/20 8:16 am, David Baker wrote: We are running Slurm v19.05.5 and I am experimenting with the *scontrol reboot * command. I find that compute nodes reboot, but they are not returned to service. Rather they remain down following the reboot.. How are you using "scontrol reboot" ? We do:

[slurm-users] Nodes do not return to service after scontrol reboot

2020-06-16 Thread David Baker
Hello, We are running Slurm v19.05.5 and I am experimenting with the scontrol reboot command. I find that compute nodes reboot, but they are not returned to service. Rather they remain down following the reboot.. navy55 1debug*down 80 2:20:2 1920000 2000 (nu

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-16 Thread Loris Bennett
Diego Zuccato writes: > Il 16/06/20 09:39, Loris Bennett ha scritto: > >>> Maybe it's already known and obvious, but... Remember that a node can be >>> allocated to only one partition. >> Maybe I am misunderstanding you, but I think that this is not the case. >> A node can be in multiple partitio

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-16 Thread Renfro, Michael
Not trying to argue unnecessarily, but what you describe is not a universal rule, regardless of QOS. Our GPU nodes are members of 3 GPU-related partitions, 2 more resource-limited non-GPU partitions, and one of two larger-memory partitions. It’s set up this way to minimize idle resources (due t

Re: [slurm-users] Slurm 20.02.3 error: CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.

2020-06-16 Thread Jeffrey T Frey
If you check the source up on Github, that's more of a warning produced when you didn't specify a CPU count and it's going to calculate from the socket-core-thread numbers (src/common/read_config.c): /* Node boards are factored into sockets */ if ((n->cpus != n-

Re: [slurm-users] [External] Re: ssh-keys on compute nodes?

2020-06-16 Thread Durai Arasan
Thank you. We are planning to put ssh keys on login nodes only and use the PAM module to control access to compute nodes. Will such a setup work? Or is it necessary for PAM to work to have the ssh keys on the compute nodes as well? I'm sorry but this is not clearly mentioned on any documentation..

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-16 Thread Diego Zuccato
Il 16/06/20 09:39, Loris Bennett ha scritto: >> Maybe it's already known and obvious, but... Remember that a node can be >> allocated to only one partition. > Maybe I am misunderstanding you, but I think that this is not the case. > A node can be in multiple partitions. *Assigned* to multiple part

Re: [slurm-users] How to view GPU indices of the completed jobs?

2020-06-16 Thread Marcus Wagner
Hi David, if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always starts from zero. So this is NOT the index of the GPU. Just verified it: $> nvidia-smi Tue Jun 16 13:28:47 2020 +-+ | NVIDIA-SMI 440.44

[slurm-users] Slurm 20.02.3 error: CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.

2020-06-16 Thread Ole Holm Nielsen
Today we upgraded the controller node from 19.05 to 20.02.3, and immediately all Slurm commands (on the controller node) give error messages for all partitions: # sinfo --version sinfo: error: NodeNames=a[001-140] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*Thread

Re: [slurm-users] Slurm 20.02.3 error: CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.

2020-06-16 Thread Ole Holm Nielsen
Hi Ahmet, On 6/16/20 11:27 AM, mercan wrote: Did you check /var/log/messages file for errors. Systemctl logs this file, instead of the slurmctl log file. Ahmet M. The syslog reports the same errors from slurmctld as are being reported by every Slurm 20.02 command. I have found a workaroun

Re: [slurm-users] Slurm 20.02.3 error: CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.

2020-06-16 Thread mercan
Hi; Did you check /var/log/messages file for errors. Systemctl logs this file, instead of the slurmctl log file. Ahmet M. 16.06.2020 11:12 tarihinde Ole Holm Nielsen yazdı: Today we upgraded the controller node from 19.05 to 20.02.3, and immediately all Slurm commands (on the controller nod

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-16 Thread Loris Bennett
Diego Zuccato writes: > Il 13/06/20 17:47, navin srivastava ha scritto: > >> Yes we have separate partitions. Some are specific to gpu having 2 nodes >> with 8 gpu and another partitions are mix of both,nodes with 2 gpu and >> very few nodes are without any gpu.  > Maybe it's already known and ob