s on behalf of Marcus
Wagner
Sent: 08 November 2019 13:00
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] oom-kill events for no good reason
Hi David,
yes, I see these messages also. I also think, this is more likely a wrong
message. If a job has been cancelled by the OOM-Kill
Hi David,
yes, I see these messages also. I also think, this is more likely a
wrong message. If a job has been cancelled by the OOM-Killer, you can
see this with sacct, e.g.
$> sacct -j 10816098
JobID JobName Partition Account AllocCPUS State
ExitCode
On 11/7/19 8:36 AM, David Baker wrote:
We are dealing with some weird issue on our shared nodes where job
appear to be stalling for some reason. I was advised that this issue
might be related to the oom-killer process. We do see a lot of these
events. In fact when I started to take a closer lo
Hello,
We are dealing with some weird issue on our shared nodes where job appear to be
stalling for some reason. I was advised that this issue might be related to the
oom-killer process. We do see a lot of these events. In fact when I started to
take a closer look this afternoon I noticed that