[slurm-users] Re: pam_slurm_adopt and multiple jobs on the same worker node

2025-04-14 Thread Paul Raines via slurm-users
Instead of using pam_slurm_adopt your users can get a shell on the node of a specific job in that job's "mapped" space by running srun --pty --overlap --jobid JOBIDNUM bash -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 14 Apr 2025 4:30am, Massimo Sgaravatto via slurm-users wro

[slurm-users] Re: cpus and gpus partitions and how to optimize the resource usage

2025-03-31 Thread Paul Raines via slurm-users
What I have done is setup partition QOSes for nodes with 4 GPUs and 64 cores sacctmgr add qos lcncpu-part sacctmgr modify qos lcncpu-part set priority=20 \ flags=DenyOnLimit MaxTRESPerNode=cpu=32,gres/gpu=0 sacctmgr add qos lcngpu-part sacctmgr modify qos lcn-part set priority=20 \ flag

[slurm-users] making a maint reservation on a specific GPU

2024-11-22 Thread Paul Raines via slurm-users
We have a 8 GPU server in which one GPU has gone into an error state that will require a reboot to clear. I have jobs on the server running on good GPUs that will take another 3 days to complete. In the meantime, I would like short jobs to run on the good free GPUs till I reboot. I set a r

[slurm-users] Re: Randomly draining nodes

2024-10-22 Thread Paul Raines via slurm-users
I have a cron job that emails me when hosts go into drain mode and tells me the reason (scontrol show node=$host | grep -i reason) We get drains with the "Kill task failed" reason probably about 5 times a day. This despite having UnkillableStepTimeout=180 Right now we are still handling the

[slurm-users] Re: [EXTERN] How do you guys track which GPU is used by which job ?

2024-10-17 Thread Paul Raines via slurm-users
We do the same thing. Our prolog has == # setup DCGMI job stats if [ -n "$CUDA_VISIBLE_DEVICES" ] ; then if [ -d /var/slurm/gpu_stats.run ] ; then if pgrep -f nv-hostengine >/dev/null 2>&1 ; then groupstr=$(/usr/bin/dcgmi group -c J$SLURM_JOB_ID -a $CUDA_VISIBLE_DEVICES) grou

[slurm-users] Re: Why AllowAccounts not work in slurm-23.11.6

2024-10-17 Thread Paul Raines via slurm-users
I am using Slurm 23.11.3 and it AllowAccounts works for me. We have a partition defied with AllowAccounts and if one tries to submit in an account not in the list one will get srun: error: Unable to allocate resources: Invalid account or account/partition combination specified Do you have

[slurm-users] Re: Max TRES per user and node

2024-09-25 Thread Paul Raines via slurm-users
I am pretty sure there is no way to do exactly a per user per node limit in SLURM. I cannot think of a good reason why one would do this. Can you explain? I don't see why it matters if you have two user submitting two 200G jobs if the jobs for the users are spread out over two nodes rather t

[slurm-users] Re: FairShare if there's only one account?

2024-08-09 Thread Paul Raines via slurm-users
I have never used Slurm where I have not added users explicitly first so I am not sure what happens in that case. But from your sshare output it certainly seems it default to fairshare=parent Trying modify the users with sacctmgr modify user $username fairshare=200 and then run sshare -a -A

[slurm-users] Re: FairShare if there's only one account?

2024-08-09 Thread Paul Raines via slurm-users
This depends on how you have assigned fairshare in sacctmgr when creating the accounts and users. At our site we want fairshare only on accounts and not users, just like you are seeing, so we create accounts with sacctmgr -i add account $acct Description="$descr" \ fairshare=200 GrpJ

[slurm-users] Re: Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Paul Raines via slurm-users
cal School -- ____ From: Paul Raines via slurm-users Sent: Tuesday, July 9, 2024 9:24 AM To: slurm-users Subject: [slurm-users] Job submitted to multiple partitions not running when any partition is full I have a job 465072 submitted to multiple partitions (rtx6000,rtx8000,pubgpu) JOBID PARTITIO

[slurm-users] Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Paul Raines via slurm-users
I have a job 465072 submitted to multiple partitions (rtx6000,rtx8000,pubgpu) JOBID PARTITION PENDING PRIORITY TRES_ALLOC|REASON 4650727 rtx6000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority 4650727 rtx8000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority 4650727 pub

[slurm-users] Re: Reserving resources for use by non-slurm stuff

2024-04-17 Thread Paul Raines via slurm-users
On a single Rocky8 workstation with one GPU where we wanted ssh interactive logins to it to have a small portion of its resources (shell, compiling, simple data manipulations, console desktop, etc) and the rest for SLURM we did this: - Set it to use cgroupv2 * modify /etc/defaultg/grub to

[slurm-users] Re: FairShare priority questions

2024-03-27 Thread Paul Raines via slurm-users
We do it by assigning fairshare just to the account. sacctmgr -i add account $acct Description="$descr" \ Organization="$org" fairshare=200 GrpJobsAccrue=8 sacctmgr -i add user "$u" DefaultAccount=$acct fairshare=parent Add the same user to other accounts with sacctmgr -i add us

[slurm-users] Re: Lua script

2024-03-06 Thread Paul Raines via slurm-users
Alternativelly consider setting EnforcePartLimits=ALL in slurm.conf -- Paul Raines (http://help.nmr.mgh.harvard.edu) The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient

[slurm-users] Re: Slurm billback and sreport

2024-03-05 Thread Paul Raines via slurm-users
Will using option "End=now" with sreport not exclude the still pending array jobs while including data for the completed ones? -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 4 Mar 2024 5:18pm, Chip Seraphine via slurm-users wrote: External Email - Use Caution That's essen

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Raines via slurm-users
What do you mean "operate via the normal command line"? When you salloc, you are still on the login node. $ salloc -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G --time=1-10:00:00 --gpus=2 --cpus-per-task=2 /bin/bash salloc: Pending job allocation 3798364 salloc: job 3798364 queued

[slurm-users] Re: Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Paul Raines via slurm-users
The Prolog will run with every job, not just "as asked for" by the user. Also it runs as the root or slurm user, not the user who submitted. For that one would use TaskProlog but at that point there is no way to abort or requeue the job I think from TaskProlog The Prolog script could check for e

[slurm-users] scheme for protected GPU jobs from preemption

2024-02-06 Thread Paul Raines via slurm-users
After using just Fairshare for over a year on our GPU cluster, we have decided it is not working for us for what we really want to achieve among our groups. We have decided to look at preemption. What we want is for users to NOT have a #job/GPU maximum (if they are only person on the cluster t

[slurm-users] Re: after upgrade to 23.11.1 nodes stuck in completion state

2024-02-01 Thread Paul Raines via slurm-users
Several jobs have run on the node with the upgrade NVIDIA driver and have gone through fine. This issue may be a bug in the NVML library itself for the nvidia-driver-NVML-535.54.03-1.el8.x86_64 driver or in SLURM 23.11's new NVML code that fetches GPU smUtil (_get_gpuutil) in gpu_nvml.c When I