subject:"Re\: \[slurm\-users\] Running job is canceled when starting a new job from queue"

Re: [slurm-users] Running job is canceled when starting a new job from queue

2019-10-29 Thread Uwe Seher

Hi all! I think i solved the problem The system is an opensuse leap 15 installation and slurm comes from the repository. By default a slurm.epilog.clean skript is installed which kills everything that belongs to the user when a job is finished including other jobs, ssh-sessions and so on. I do not

Re: [slurm-users] Running job is canceled when starting a new job from queue

2019-10-28 Thread Uwe Seher

Hello! I cannot fond any hints on oom-kills, but it is systemd so i need maybe a little more time searching. We have 128GB mem on the node and the tasks do not use this to the limit we know, dependencies have also worked fine with the same tasks. Monitoring does not show any problems with memory. T

Re: [slurm-users] Running job is canceled when starting a new job from queue

2019-10-28 Thread Lech Nieroda

Hello Uwe, when the requested time limit of a job runs out the job is cancelled and terminated with signal SIGTERM (15) and later on SIGKILL (9) if that should fail, the job gets the state „TIMEOUT“. However the job 161 gets killed immediately by SIGKILL and gets the state „FAILED“. That sugges