Thanks so much Greg! That looks like the solution we want, but like John I'm also unfamiliar with spank plugins. I guess that will have to change.
Mark On Wed, Apr 6, 2022 at 7:54 AM John Hanks <griz...@gmail.com> wrote: > Thanks, Greg! This looks like the right way to do this. I will have to > stop putting off learning to use spank plugins :) > > griznog > > On Wed, Apr 6, 2022 at 1:40 AM Greg Wickham <greg.wick...@kaust.edu.sa> > wrote: > >> Hi John, Mark, >> >> >> >> We use a spank plugin >> https://gitlab.com/greg.wickham/slurm-spank-private-tmpdir (this was >> derived from other authors but modified for functionality required on site). >> >> >> >> It can bind tmpfs mount points to the users cgroup allocation, >> additionally bind options can be provided (ie: limit memory by size, limit >> memory by % as supported by tmpfs(5)) >> >> >> >> More information is in the README.md >> >> >> >> -Greg >> >> >> >> On 05/04/2022, 23:17, "slurm-users" < >> slurm-users-boun...@lists.schedmd.com> wrote: >> >> >> >> I've thought-experimented this in the past, wanting to do the same thing >> but haven't found any way to get a/dev/shm or a tmpfs into a job's cgroups >> to be accounted against the job's allocation. The best I have come up with >> is creating a per-job tmpfs from a prolog, removing from epilog and setting >> its size to be some amount of memory that at least puts some restriction on >> how much damage the job could do. Another alternative is to only allow >> access to a memory filesystem if the job request is exclusive and takes the >> whole node. Crude, but effective at least to the point of preventing one >> job from killing others. If you happen to find a real solution, please post >> it :) >> >> >> >> griznog >> >> >> >> On Mon, Apr 4, 2022 at 10:19 AM Mark Coatsworth < >> mark.coatswo...@vectorinstitute.ai> wrote: >> >> Hi all, >> >> >> >> We have a GPU cluster (Slurm 19.05.3) that typically runs large PyTorch >> jobs dependent on shared memory (/dev/shm). When our machines get busy, we >> often run into a problem where one job exhausts all the shared memory on a >> system, causing any other jobs landing there to fail immediately. >> >> >> >> We're trying to figure out a good way to manage this resource. I know >> that Slurm counts shared memory as part of a job's total memory allocation, >> so we could use cgroups to OOM kill jobs that exceed this. But that doesn't >> prevent a user from just making a large request and exhausting it all >> anyway. >> >> >> >> Does anybody have any thoughts or experience with setting real limits on >> shared memory, and either swapping it out or killing the job if this gets >> exceeded? One thought we had was to use a new generic resource (GRES). This >> is pretty easy to add in the configuration, but seems like it would be a >> huge task to write a plugin that actually enforces it. >> >> >> >> Is this something where the Job Container plugin might be useful? >> >> >> >> Any thoughts or suggestions would be appreciated, >> >> >> >> Mark >> >>