Hi all!
I think i solved the problem
The system is an opensuse leap 15 installation and slurm comes from the
repository. By default a slurm.epilog.clean skript is installed which kills
everything that belongs to the user when a job is finished including other
jobs, ssh-sessions and so on. I do not
Hello!
I cannot fond any hints on oom-kills, but it is systemd so i need maybe a
little more time searching. We have 128GB mem on the node and the tasks do
not use this to the limit we know, dependencies have also worked fine with
the same tasks. Monitoring does not show any problems with memory. T
Hello Uwe,
when the requested time limit of a job runs out the job is cancelled and
terminated with signal SIGTERM (15) and later on SIGKILL (9) if that should
fail, the job gets the state „TIMEOUT“.
However the job 161 gets killed immediately by SIGKILL and gets the state
„FAILED“. That sugges