On 5/19/21 1:41 pm, Tim Carlson wrote:
but I still don't understand how with "shared=exclusive" srun gives one
result and sbatch gives another.
I can't either, but I can reproduce it with Slurm 20.11.7. :-/
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On 5/19/21 9:15 pm, Herc Silverstein wrote:
Does anyone have an idea of what might be going on?
To add to the other suggestions, I would say that checking the slurmctld
and slurmd logs to see what it is saying is wrong is a good place to start.
Best of luck,
Chris
--
Chris Samuel : http
The SLURM controller AND all the compute nodes need to know who all is in
the cluster. If you want to add a node or it changes IP addresses, you need
to let all the nodes know about this which, for me, usually means
restarting slurmd on the compute nodes.
I just say this because I get caught by th
Does it tell you the reason for it being down?
sinfo -R
I have seen where a node comes up, but the amount of memory slurmd sees
is a little less than what was configured in slurm.conf.
You should always set aside some of the memory when defining it in
slurm.conf so you have memory for the oper
I know but see script we only do this for uid > 1000.
On 20/05/2021 17:29, Timo Rothenpieler wrote:
You shouldn't need this script and pam_exec.
You can set those limits directly in the systemd config to match every
user.
On 20.05.2021 16:28, Bas van der Vlies wrote:
same here we use the sys
You shouldn't need this script and pam_exec.
You can set those limits directly in the systemd config to match every user.
On 20.05.2021 16:28, Bas van der Vlies wrote:
same here we use the systemd user slice in out pam stack:
```
# Setup for local and ldap logins
session required pam_systemd.
Hi Community,
just wanted to share that this problem got solved with the help of pyxis
developers
https://github.com/NVIDIA/pyxis/issues/47
The solution was to add
ConstrainDevices=yes
as it was missing in the cgroup.conf file
On Thu, May 13, 2021 at 5:14 PM Cristóbal Navarro <
cristobal.navarr.
same here we use the systemd user slice in out pam stack:
```
# Setup for local and ldap logins
session required pam_systemd.so
session required pam_exec.so seteuid type=open_session
/etc/security/limits.sh
```
limit.sh:
```
#!/bin/sh -e
PAM_UID=$(getent passwd "${PAM_USER}" | cut -d: -f3
Hi;
We use a bash script to watch and kill users' processes, if they exceed
the our cpu and memory limits. Also this solution ensures total usage of
cpu or memory can not exceed because of a lot of well behaved users as
well as a bad user:
https://github.com/mercanca/kill_for_loginnode.sh
A
On 24.04.2021 04:37, Cristóbal Navarro wrote:
Hi Community,
I have a set of users still not so familiar with slurm, and yesterday
they bypassed srun/sbatch and just ran their CPU program directly on the
head/login node thinking it would still run on the compute node. I am
aware that I will nee
We had a situation recently where a desktop was turned off for a week. When
we brought it back online (in a different part of the network with a different
IP), everything came up fine (slurmd and munge).
But it kept going into DOWN* for no apparent reason (neither daemon-wise nor
log-wise).
As p
For our login nodes (smallish, diskless VMs) we try and limit abuse from
users through a layered approach as enumerated below.
1. User education
Users of our cluster are required to attend a training that is run by
our group. In these sessions we do go over what we do and don't allow
on the
12 matches
Mail list logo