Damn,
I almost always forget, that most of the submission part is done on the
master :/
Best
Marcus
On 10/8/19 11:45 AM, Eddy Swan wrote:
Hi Sean,
Thank you so much for your additional information.
The issue is indeed due to missing user on the head node.
After i configured ldap client on s
On 10/8/19 1:47 AM, Kota Tsuyuzaki wrote:
> GPU is running as well as gres gpu:1. And more, the NVIDIA docs looks to
> describe what I hit
> (https://docs.nvidia.com/deploy/mps/index.html#topic_4_3). That seems like
> the mps-server will be created to each user and the
> server will be running
Marcus Boden writes:
> you're looking for KillOnBadExit in the slurm.conf:
> KillOnBadExit
[...]
> this should terminate the job if a step or a process gets oom-killed.
That is a good tip!
But as I read the documentation (I haven't tested it), it will only kill
the job step itself, it will no
Juergen Salk writes:
> that is interesting. We have a very similar setup as well. However, in
> our Slurm test cluster I have noticed that it is not the *job* that
> gets killed. Instead, the OOM killer terminates one (or more)
> *processes*
Yes, that is how the kernel OOM killer works.
This is
Hello, thanks for you answers,
> - Does it work if you remove the space in "TaskPlugin=task/affinity,
> task/cgroup"? (Slurm can be quite picky when reading slurm.conf).
It was the case, I make a mistake when I copy/cut... So, I haven't space here.
>
> - See in slurmd.log on the node(s) of the
> On 19-10-08 10:36, Juergen Salk wrote:
> > * Bjørn-Helge Mevik [191008 08:34]:
> > > Jean-mathieu CHANTREIN writes:
> > >
> > > > I tried using, in slurm.conf
> > > > TaskPlugin=task/affinity, task/cgroup
> > > > SelectTypeParameters=CR_CPU_Memory
> > > > MemLimitEnforce=yes
> > > >
> > >
Hi Sean,
Thank you so much for your additional information.
The issue is indeed due to missing user on the head node.
After i configured ldap client on slurm-master, srun command is now working
using ldap account.
Best regards,
Eddy Swan
On Tue, Oct 8, 2019 at 4:15 PM Sean Crosby wrote:
> Look
Hi Jürgen,
you're looking for KillOnBadExit in the slurm.conf:
KillOnBadExit
If set to 1, a step will be terminated immediately if any task is crashed
or aborted, as indicated by a non-zero exit code. With the default value of 0,
if one of the processes is crashed or aborted the other proces
* Bjørn-Helge Mevik [191008 08:34]:
> Jean-mathieu CHANTREIN writes:
>
> > I tried using, in slurm.conf
> > TaskPlugin=task/affinity, task/cgroup
> > SelectTypeParameters=CR_CPU_Memory
> > MemLimitEnforce=yes
> >
> > and in cgroup.conf:
> > CgroupAutomount=yes
> > ConstrainCores=yes
> > C
Looking at the SLURM code, it looks like it is failing with a call to
getpwuid_r on the ctld
What is (on slurm-master):
getent passwd turing
getent passwd 1000
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Platform Services | Business Services
CoEPP Research Compu
10 matches
Mail list logo