Re: [slurm-users] oom-kill events for no good reason

2019-11-12 Thread David Baker
s on behalf of Marcus Wagner Sent: 08 November 2019 13:00 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] oom-kill events for no good reason Hi David, yes, I see these messages also. I also think, this is more likely a wrong message. If a job has been cancelled by the OOM-Kill

Re: [slurm-users] oom-kill events for no good reason

2019-11-08 Thread Marcus Wagner
Hi David, yes, I see these messages also. I also think, this is more likely a wrong message. If a job has been cancelled by the OOM-Killer, you can see this with sacct, e.g. $> sacct -j 10816098    JobID    JobName  Partition    Account  AllocCPUS  State ExitCode

Re: [slurm-users] oom-kill events for no good reason

2019-11-07 Thread Christopher Samuel
On 11/7/19 8:36 AM, David Baker wrote: We are dealing with some weird issue on our shared nodes where job appear to be stalling for some reason. I was advised that this issue might be related to the oom-killer process. We do see a lot of these events. In fact when I started to take a closer lo

[slurm-users] oom-kill events for no good reason

2019-11-07 Thread David Baker
Hello, We are dealing with some weird issue on our shared nodes where job appear to be stalling for some reason. I was advised that this issue might be related to the oom-killer process. We do see a lot of these events. In fact when I started to take a closer look this afternoon I noticed that