Instead of using pam_slurm_adopt your users can get a shell on the node
of a specific job in that job's "mapped" space by running
srun --pty --overlap --jobid JOBIDNUM bash
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Mon, 14 Apr 2025 4:30am, Massimo Sgaravatto via slurm-users wro
What I have done is setup partition QOSes for nodes with 4 GPUs and 64
cores
sacctmgr add qos lcncpu-part
sacctmgr modify qos lcncpu-part set priority=20 \
flags=DenyOnLimit MaxTRESPerNode=cpu=32,gres/gpu=0
sacctmgr add qos lcngpu-part
sacctmgr modify qos lcn-part set priority=20 \
flag
We have a 8 GPU server in which one GPU has gone into an error state that
will require a reboot to clear. I have jobs on the server running on good
GPUs that will take another 3 days to complete. In the meantime, I would
like short jobs to run on the good free GPUs till I reboot.
I set a r
I have a cron job that emails me when hosts go into drain mode and
tells me the reason (scontrol show node=$host | grep -i reason)
We get drains with the "Kill task failed" reason probably about 5 times a
day. This despite having UnkillableStepTimeout=180
Right now we are still handling the
We do the same thing. Our prolog has
==
# setup DCGMI job stats
if [ -n "$CUDA_VISIBLE_DEVICES" ] ; then
if [ -d /var/slurm/gpu_stats.run ] ; then
if pgrep -f nv-hostengine >/dev/null 2>&1 ; then
groupstr=$(/usr/bin/dcgmi group -c J$SLURM_JOB_ID -a
$CUDA_VISIBLE_DEVICES)
grou
I am using Slurm 23.11.3 and it AllowAccounts works for me. We
have a partition defied with AllowAccounts and if one tries to
submit in an account not in the list one will get
srun: error: Unable to allocate resources: Invalid account or
account/partition combination specified
Do you have
I am pretty sure there is no way to do exactly a per user per node limit
in SLURM. I cannot think of a good reason why one would do this. Can
you explain?
I don't see why it matters if you have two user submitting two 200G jobs
if the jobs for the users are spread out over two nodes rather t
I have never used Slurm where I have not added users explicitly first so I
am not sure what happens in that case. But from your sshare output it
certainly seems it default to fairshare=parent
Trying modify the users with
sacctmgr modify user $username fairshare=200
and then run sshare -a -A
This depends on how you have assigned fairshare in sacctmgr when creating
the accounts and users. At our site we want fairshare only on accounts
and not users, just like you are seeing, so we create accounts with
sacctmgr -i add account $acct Description="$descr" \
fairshare=200 GrpJ
cal School
--
____
From: Paul Raines via slurm-users
Sent: Tuesday, July 9, 2024 9:24 AM
To: slurm-users
Subject: [slurm-users] Job submitted to multiple partitions not running when
any partition is full
I have a job 465072 submitted to multiple partitions (rtx6000,rtx8000,pubgpu)
JOBID PARTITIO
I have a job 465072 submitted to multiple partitions (rtx6000,rtx8000,pubgpu)
JOBID PARTITION PENDING PRIORITY TRES_ALLOC|REASON
4650727 rtx6000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority
4650727 rtx8000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority
4650727 pub
On a single Rocky8 workstation with one GPU where we wanted ssh
interactive logins to it to have a small portion of its resources (shell,
compiling, simple data manipulations, console desktop, etc) and the rest
for SLURM we did this:
- Set it to use cgroupv2
* modify /etc/defaultg/grub to
We do it by assigning fairshare just to the account.
sacctmgr -i add account $acct Description="$descr" \
Organization="$org" fairshare=200 GrpJobsAccrue=8
sacctmgr -i add user "$u" DefaultAccount=$acct fairshare=parent
Add the same user to other accounts with
sacctmgr -i add us
Alternativelly consider setting EnforcePartLimits=ALL in slurm.conf
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient
Will using option "End=now" with sreport not exclude the still
pending array jobs while including data for the completed ones?
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Mon, 4 Mar 2024 5:18pm, Chip Seraphine via slurm-users wrote:
External Email - Use Caution
That's essen
What do you mean "operate via the normal command line"? When
you salloc, you are still on the login node.
$ salloc -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G
--time=1-10:00:00 --gpus=2 --cpus-per-task=2 /bin/bash
salloc: Pending job allocation 3798364
salloc: job 3798364 queued
The Prolog will run with every job, not just "as asked for" by the user.
Also it runs as the root or slurm user, not the user who submitted.
For that one would use TaskProlog but at that point there is no
way to abort or requeue the job I think from TaskProlog
The Prolog script could check for e
After using just Fairshare for over a year on our GPU cluster, we
have decided it is not working for us for what we really want
to achieve among our groups. We have decided to look at preemption.
What we want is for users to NOT have a #job/GPU maximum (if they are
only person on the cluster t
Several jobs have run on the node with the upgrade NVIDIA driver and
have gone through fine.
This issue may be a bug in the NVML library itself for the
nvidia-driver-NVML-535.54.03-1.el8.x86_64 driver or in SLURM 23.11's new
NVML code that fetches GPU smUtil (_get_gpuutil) in gpu_nvml.c
When I
19 matches
Mail list logo