Thanks, Greg! This looks like the right way to do this. I will have to stop putting off learning to use spank plugins :)
griznog On Wed, Apr 6, 2022 at 1:40 AM Greg Wickham <greg.wick...@kaust.edu.sa> wrote: > Hi John, Mark, > > > > We use a spank plugin > https://gitlab.com/greg.wickham/slurm-spank-private-tmpdir (this was > derived from other authors but modified for functionality required on site). > > > > It can bind tmpfs mount points to the users cgroup allocation, > additionally bind options can be provided (ie: limit memory by size, limit > memory by % as supported by tmpfs(5)) > > > > More information is in the README.md > > > > -Greg > > > > On 05/04/2022, 23:17, "slurm-users" <slurm-users-boun...@lists.schedmd.com> > wrote: > > > > I've thought-experimented this in the past, wanting to do the same thing > but haven't found any way to get a/dev/shm or a tmpfs into a job's cgroups > to be accounted against the job's allocation. The best I have come up with > is creating a per-job tmpfs from a prolog, removing from epilog and setting > its size to be some amount of memory that at least puts some restriction on > how much damage the job could do. Another alternative is to only allow > access to a memory filesystem if the job request is exclusive and takes the > whole node. Crude, but effective at least to the point of preventing one > job from killing others. If you happen to find a real solution, please post > it :) > > > > griznog > > > > On Mon, Apr 4, 2022 at 10:19 AM Mark Coatsworth < > mark.coatswo...@vectorinstitute.ai> wrote: > > Hi all, > > > > We have a GPU cluster (Slurm 19.05.3) that typically runs large PyTorch > jobs dependent on shared memory (/dev/shm). When our machines get busy, we > often run into a problem where one job exhausts all the shared memory on a > system, causing any other jobs landing there to fail immediately. > > > > We're trying to figure out a good way to manage this resource. I know that > Slurm counts shared memory as part of a job's total memory allocation, so > we could use cgroups to OOM kill jobs that exceed this. But that doesn't > prevent a user from just making a large request and exhausting it all > anyway. > > > > Does anybody have any thoughts or experience with setting real limits on > shared memory, and either swapping it out or killing the job if this gets > exceeded? One thought we had was to use a new generic resource (GRES). This > is pretty easy to add in the configuration, but seems like it would be a > huge task to write a plugin that actually enforces it. > > > > Is this something where the Job Container plugin might be useful? > > > > Any thoughts or suggestions would be appreciated, > > > > Mark > >