[slurm-users] GPU shards not exclusive

2024-02-14 Thread Reed Dier via slurm-users
I seem to have run into an edge case where I’m able to oversubscribe a specific subset of GPUs on one host in particular. Slurm 22.05.8 Ubuntu 20.04 cgroups v1 (ProctrackType=proctrack/cgroup) It seems to be partly a corner case with a couple of caveats. This host has 2 different GPU types in th

[slurm-users] Problem building slurm with PMIx

2024-02-14 Thread Patrick Begou via slurm-users
Hi ! I manage a small CentOS8 cluster using slurm  slurm-20.11.7-1 and OpenMPI built from sources. - I know this OS is not maintained any more and I need to negotiate downtime to reinstall - I know Slurm 20.11.7 has security issue (I've built it from source some years ago with rpmbuild -ta --w

[slurm-users] Re: Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Paul Raines via slurm-users
The Prolog will run with every job, not just "as asked for" by the user. Also it runs as the root or slurm user, not the user who submitted. For that one would use TaskProlog but at that point there is no way to abort or requeue the job I think from TaskProlog The Prolog script could check for e

[slurm-users] Re: Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Paul Edmon via slurm-users
You probably want the Prolog option: https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog along with: https://slurm.schedmd.com/slurm.conf.html#OPT_ForceRequeueOnFail -Paul Edmon- On 2/14/2024 8:38 AM, Cutts, Tim via slurm-users wrote: Hi, I apologise if I’ve failed to find this in the docum

[slurm-users] Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Cutts, Tim via slurm-users
Hi, I apologise if I’ve failed to find this in the documentation (and am happy to be told to RTFM) but a recent issue for one of my users resulted in a question I couldn’t answer. LSF has a feature called a Pre-Exec where a script executes to check whether a node is ready to run a task. So, yo

[slurm-users] job_submit.lua - uid in Docker cluster

2024-02-14 Thread Loris Bennett via slurm-users
Hi, Having used https://github.com/giovtorres/slurm-docker-cluster successfully a couple of years ago to develop a job_submit.lua plugin, I am trying to do this again. However, the plugin which works on our current cluster (CentOS 7.9, Slurm 23.02.7) fails in the Docker cluster (Rocky 8.9, S