Re: [slurm-users] srun: Error generating job credential

2019-10-08 Thread Marcus Wagner
Damn, I almost always forget, that most of the submission part is done on the master :/ Best Marcus On 10/8/19 11:45 AM, Eddy Swan wrote: Hi Sean, Thank you so much for your additional information. The issue is indeed due to missing user on the head node. After i configured ldap client on s

Re: [slurm-users] How to share GPU resources? (MPS or another way?)

2019-10-08 Thread Goetz, Patrick G
On 10/8/19 1:47 AM, Kota Tsuyuzaki wrote: > GPU is running as well as gres gpu:1. And more, the NVIDIA docs looks to > describe what I hit > (https://docs.nvidia.com/deploy/mps/index.html#topic_4_3). That seems like > the mps-server will be created to each user and the > server will be running

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Bjørn-Helge Mevik
Marcus Boden writes: > you're looking for KillOnBadExit in the slurm.conf: > KillOnBadExit [...] > this should terminate the job if a step or a process gets oom-killed. That is a good tip! But as I read the documentation (I haven't tested it), it will only kill the job step itself, it will no

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Bjørn-Helge Mevik
Juergen Salk writes: > that is interesting. We have a very similar setup as well. However, in > our Slurm test cluster I have noticed that it is not the *job* that > gets killed. Instead, the OOM killer terminates one (or more) > *processes* Yes, that is how the kernel OOM killer works. This is

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Jean-mathieu CHANTREIN
Hello, thanks for you answers, > - Does it work if you remove the space in "TaskPlugin=task/affinity, > task/cgroup"? (Slurm can be quite picky when reading slurm.conf). It was the case, I make a mistake when I copy/cut... So, I haven't space here. > > - See in slurmd.log on the node(s) of the

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Juergen Salk
> On 19-10-08 10:36, Juergen Salk wrote: > > * Bjørn-Helge Mevik [191008 08:34]: > > > Jean-mathieu CHANTREIN writes: > > > > > > > I tried using, in slurm.conf > > > > TaskPlugin=task/affinity, task/cgroup > > > > SelectTypeParameters=CR_CPU_Memory > > > > MemLimitEnforce=yes > > > > > > >

Re: [slurm-users] srun: Error generating job credential

2019-10-08 Thread Eddy Swan
Hi Sean, Thank you so much for your additional information. The issue is indeed due to missing user on the head node. After i configured ldap client on slurm-master, srun command is now working using ldap account. Best regards, Eddy Swan On Tue, Oct 8, 2019 at 4:15 PM Sean Crosby wrote: > Look

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Marcus Boden
Hi Jürgen, you're looking for KillOnBadExit in the slurm.conf: KillOnBadExit If set to 1, a step will be terminated immediately if any task is crashed or aborted, as indicated by a non-zero exit code. With the default value of 0, if one of the processes is crashed or aborted the other proces

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Juergen Salk
* Bjørn-Helge Mevik [191008 08:34]: > Jean-mathieu CHANTREIN writes: > > > I tried using, in slurm.conf > > TaskPlugin=task/affinity, task/cgroup > > SelectTypeParameters=CR_CPU_Memory > > MemLimitEnforce=yes > > > > and in cgroup.conf: > > CgroupAutomount=yes > > ConstrainCores=yes > > C

Re: [slurm-users] srun: Error generating job credential

2019-10-08 Thread Sean Crosby
Looking at the SLURM code, it looks like it is failing with a call to getpwuid_r on the ctld What is (on slurm-master): getent passwd turing getent passwd 1000 Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Platform Services | Business Services CoEPP Research Compu