Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-12-07 Thread Ryan Novosielski
This is only so relevant, but the scenario presents itself similarly. This is not in a scheduler environment, but we have an interactive server that would have PS hangs on certain tasks (top -bn1 is a way around that, BTW, if it’s hard to even find out what the process is). For us, it appeared t

Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-12-07 Thread Christopher Benjamin Coffey
Is this parameter applied to each cgroup? Or just the system itself? Seems like just the system itself. — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 12/4/18, 10:13 AM, "slurm-users on behalf of Christopher Benjamin Coffey" wrote: Interesti

Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-12-04 Thread Christopher Benjamin Coffey
Interesting! I'll have a look - thanks! — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 11/30/18, 1:41 AM, "slurm-users on behalf of John Hearns" wrote: Chris, I have delved deep into the OOM killer code and interaction with cpusets in the p

Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-11-30 Thread John Hearns
Chris, I have delved deep into the OOM killer code and interaction with cpusets in the past (*). That experience is not really relevant! However I always recommend looking at this sysctl parameter min_free_kbytes https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/perform

Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-11-30 Thread Ole Holm Nielsen
On 29-11-2018 19:27, Christopher Benjamin Coffey wrote: We've been noticing an issue with nodes from time to time that become "wedged", or unusable. This is a state where ps, and w hang. We've been looking into this for a while when we get time and finally put some more effort into it yesterday

[slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-11-29 Thread Christopher Benjamin Coffey
Hi, We've been noticing an issue with nodes from time to time that become "wedged", or unusable. This is a state where ps, and w hang. We've been looking into this for a while when we get time and finally put some more effort into it yesterday. We came across this blog which describes almost th