I seem to have run into an edge case where I’m able to oversubscribe a specific
subset of GPUs on one host in particular.
Slurm 22.05.8
Ubuntu 20.04
cgroups v1 (ProctrackType=proctrack/cgroup)
It seems to be partly a corner case with a couple of caveats.
This host has 2 different GPU types in th
Hi !
I manage a small CentOS8 cluster using slurm slurm-20.11.7-1 and
OpenMPI built from sources.
- I know this OS is not maintained any more and I need to negotiate
downtime to reinstall
- I know Slurm 20.11.7 has security issue (I've built it from source
some years ago with rpmbuild -ta --w
The Prolog will run with every job, not just "as asked for" by the user.
Also it runs as the root or slurm user, not the user who submitted.
For that one would use TaskProlog but at that point there is no
way to abort or requeue the job I think from TaskProlog
The Prolog script could check for e
You probably want the Prolog option:
https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog along with:
https://slurm.schedmd.com/slurm.conf.html#OPT_ForceRequeueOnFail
-Paul Edmon-
On 2/14/2024 8:38 AM, Cutts, Tim via slurm-users wrote:
Hi, I apologise if I’ve failed to find this in the docum
Hi, I apologise if I’ve failed to find this in the documentation (and am happy
to be told to RTFM) but a recent issue for one of my users resulted in a
question I couldn’t answer.
LSF has a feature called a Pre-Exec where a script executes to check whether a
node is ready to run a task. So, yo
Hi,
Having used
https://github.com/giovtorres/slurm-docker-cluster
successfully a couple of years ago to develop a job_submit.lua plugin,
I am trying to do this again.
However, the plugin which works on our current cluster (CentOS 7.9,
Slurm 23.02.7) fails in the Docker cluster (Rocky 8.9, S