Re: [slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch

2021-05-20 Thread Christopher Samuel
On 5/19/21 1:41 pm, Tim Carlson wrote: but I still don't understand how with "shared=exclusive" srun gives one result and sbatch gives another. I can't either, but I can reproduce it with Slurm 20.11.7. :-/ -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] nodes going to down* and getting stuck in that state

2021-05-20 Thread Christopher Samuel
On 5/19/21 9:15 pm, Herc Silverstein wrote: Does anyone have an idea of what might be going on? To add to the other suggestions, I would say that checking the slurmctld and slurmd logs to see what it is saying is wrong is a good place to start. Best of luck, Chris -- Chris Samuel : http

Re: [slurm-users] nodes going to down* and getting stuck in that state

2021-05-20 Thread Tim Carlson
The SLURM controller AND all the compute nodes need to know who all is in the cluster. If you want to add a node or it changes IP addresses, you need to let all the nodes know about this which, for me, usually means restarting slurmd on the compute nodes. I just say this because I get caught by th

Re: [slurm-users] nodes going to down* and getting stuck in that state

2021-05-20 Thread Brian Andrus
Does it tell you the reason for it being down? sinfo -R I have seen where a node comes up, but the amount of memory slurmd sees is a little less than what was configured in slurm.conf. You should always set aside some of the memory when defining it in slurm.conf so you have memory for the oper

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-05-20 Thread Bas van der Vlies
I know but see script we only do this for uid > 1000. On 20/05/2021 17:29, Timo Rothenpieler wrote: You shouldn't need this script and pam_exec. You can set those limits directly in the systemd config to match every user. On 20.05.2021 16:28, Bas van der Vlies wrote: same here we use the sys

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-05-20 Thread Timo Rothenpieler
You shouldn't need this script and pam_exec. You can set those limits directly in the systemd config to match every user. On 20.05.2021 16:28, Bas van der Vlies wrote: same here we use the systemd user slice in out pam stack: ``` # Setup for local and ldap  logins session required   pam_systemd.

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-05-20 Thread Cristóbal Navarro
Hi Community, just wanted to share that this problem got solved with the help of pyxis developers https://github.com/NVIDIA/pyxis/issues/47 The solution was to add ConstrainDevices=yes as it was missing in the cgroup.conf file On Thu, May 13, 2021 at 5:14 PM Cristóbal Navarro < cristobal.navarr.

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-05-20 Thread Bas van der Vlies
same here we use the systemd user slice in out pam stack: ``` # Setup for local and ldap logins session required pam_systemd.so session required pam_exec.so seteuid type=open_session /etc/security/limits.sh ``` limit.sh: ``` #!/bin/sh -e PAM_UID=$(getent passwd "${PAM_USER}" | cut -d: -f3

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-05-20 Thread mercan
Hi; We use a bash script to watch and kill users' processes, if they exceed the our cpu and memory limits. Also this solution ensures total usage of cpu or memory can not exceed because of a lot of well behaved users as well as a bad user: https://github.com/mercanca/kill_for_loginnode.sh A

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-05-20 Thread Timo Rothenpieler
On 24.04.2021 04:37, Cristóbal Navarro wrote: Hi Community, I have a set of users still not so familiar with slurm, and yesterday they bypassed srun/sbatch and just ran their CPU program directly on the head/login node thinking it would still run on the compute node. I am aware that I will nee

Re: [slurm-users] nodes going to down* and getting stuck in that state

2021-05-20 Thread bbenedetto
We had a situation recently where a desktop was turned off for a week. When we brought it back online (in a different part of the network with a different IP), everything came up fine (slurmd and munge). But it kept going into DOWN* for no apparent reason (neither daemon-wise nor log-wise). As p

Re: [slurm-users] What is an easy way to prevent users run programs on the, master/login node.

2021-05-20 Thread David Schanzenbach
For our login nodes (smallish, diskless VMs) we try and limit abuse from users through a layered approach as enumerated below. 1. User education Users of our cluster are required to attend a training that is run by our group.  In these sessions we do  go over what we do and don't allow on the