[slurm-users] Re: pam_slurm_adopt and multiple jobs on the same worker node

2025-04-14 Thread Paul Raines via slurm-users
Instead of using pam_slurm_adopt your users can get a shell on the node of a specific job in that job's "mapped" space by running srun --pty --overlap --jobid JOBIDNUM bash -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 14 Apr 2025 4:30am, Massimo Sgaravatto

[slurm-users] Re: cpus and gpus partitions and how to optimize the resource usage

2025-03-31 Thread Paul Raines via slurm-users
What I have done is setup partition QOSes for nodes with 4 GPUs and 64 cores sacctmgr add qos lcncpu-part sacctmgr modify qos lcncpu-part set priority=20 \ flags=DenyOnLimit MaxTRESPerNode=cpu=32,gres/gpu=0 sacctmgr add qos lcngpu-part sacctmgr modify qos lcn-part set priority=20 \ flag

[slurm-users] making a maint reservation on a specific GPU

2024-11-22 Thread Paul Raines via slurm-users
) from now till 2024-11-25T06:00:00 so no job runs that will use it. Is that possible? --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street

[slurm-users] Re: Randomly draining nodes

2024-10-22 Thread Paul Raines via slurm-users
nfs kernel process finally exit. In the rare case we can not find a way to kill the unkillable process we arrange to reboot the node. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 22 Oct 2024 12:59am, Christopher Samuel via slurm-users wrote: External Email - Use Caution On

[slurm-users] Re: [EXTERN] How do you guys track which GPU is used by which job ?

2024-10-17 Thread Paul Raines via slurm-users
We do the same thing. Our prolog has == # setup DCGMI job stats if [ -n "$CUDA_VISIBLE_DEVICES" ] ; then if [ -d /var/slurm/gpu_stats.run ] ; then if pgrep -f nv-hostengine >/dev/null 2>&1 ; then groupstr=$(/usr/bin/dcgmi group -c J$SLURM_JOB_ID -a $CUDA_VISIBLE_DEVICES) grou

[slurm-users] Re: Why AllowAccounts not work in slurm-23.11.6

2024-10-17 Thread Paul Raines via slurm-users
I am using Slurm 23.11.3 and it AllowAccounts works for me. We have a partition defied with AllowAccounts and if one tries to submit in an account not in the list one will get srun: error: Unable to allocate resources: Invalid account or account/partition combination specified Do you have

[slurm-users] Re: Max TRES per user and node

2024-09-25 Thread Paul Raines via slurm-users
e are definitely ways to do that using MemSpecLimit on the node. You can even apportion CPU cores using CpuSpecLimit and varios cgroups v2 settings at the OS level Otherwise there maybe a way with some fancy scripting in LUA submit plugin or playing around with Feature/Helper plugin -- Paul Raines

[slurm-users] Re: FairShare if there's only one account?

2024-08-09 Thread Paul Raines via slurm-users
mic to see what has changed. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Fri, 9 Aug 2024 5:57pm, Drucker, Daniel wrote: Hi Paul from over at mclean.harvard.edu<http://mclean.harvard.edu>! I have never added any users using sacctmgr - I've always just had everyo

[slurm-users] Re: FairShare if there's only one account?

2024-08-09 Thread Paul Raines via slurm-users
rshare=200 GrpJobsAccrue=8 and users with sacctmgr -i add user "$u" account=$acct fairshare=parent If you want users to have their own independent fairshare, you do not use fairshare=parent but assign a real number. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Fri, 9 Aug

[slurm-users] Re: Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Paul Raines via slurm-users
Thanks. I traced it to a MaxMemPerCPU=16384 setting on the pubgpu partition. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 9 Jul 2024 2:39pm, Timony, Mick wrote: External Email - Use Caution Hi Paul, There could be multiple reasons why the job isn't running,

[slurm-users] Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Paul Raines via slurm-users
? --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129USA The information in this e-mail is intended only for the person to whom

[slurm-users] Re: Reserving resources for use by non-slurm stuff

2024-04-17 Thread Paul Raines via slurm-users
oo and that cause wierd behavior with a lot of system tools. So far the root/daemon process work fine in the 20GB limit though so that MemoryHigh=20480M is one and done Then reboot. -- Paul Raines (http://help.nmr.mgh.harvard.edu) The information in this e-mail is intended only

[slurm-users] Re: FairShare priority questions

2024-03-27 Thread Paul Raines via slurm-users
counts with sacctmgr -i add user "$u" account=$acct fairshare=parent -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Wed, 27 Mar 2024 9:22am, Long, Daniel S. via slurm-users wrote: External Email - Use Caution Hi, I’m trying to set up multifactor priority on our cluste

[slurm-users] Re: Lua script

2024-03-06 Thread Paul Raines via slurm-users
Alternativelly consider setting EnforcePartLimits=ALL in slurm.conf -- Paul Raines (http://help.nmr.mgh.harvard.edu) The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient

[slurm-users] Re: Slurm billback and sreport

2024-03-05 Thread Paul Raines via slurm-users
Will using option "End=now" with sreport not exclude the still pending array jobs while including data for the completed ones? -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 4 Mar 2024 5:18pm, Chip Seraphine via slurm-users wrote: External Email - Use Cauti

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Raines via slurm-users
tname mlsc-login.nmr.mgh.harvard.edu mlsc-login[0]:~$ printenv | grep SLURM_JOB_NODELIST SLURM_JOB_NODELIST=rtx-02 Seems you MUST use srun -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Wed, 28 Feb 2024 10:25am, Paul Edmon via slurm-users wrote: External Email - Use Caution salloc is the currently recom

[slurm-users] Re: Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Paul Raines via slurm-users
s any good way to do this with safe requeing. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Wed, 14 Feb 2024 9:32am, Paul Edmon via slurm-users wrote: External Email - Use Caution You probably want the Prolog option: https://secure-web.cisco.com/1gA

[slurm-users] scheme for protected GPU jobs from preemption

2024-02-06 Thread Paul Raines via slurm-users
a complicated cron job that tries to do it all "outside" of SLURM issuing scontrol commands. ------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (

[slurm-users] Re: after upgrade to 23.11.1 nodes stuck in completion state

2024-02-01 Thread Paul Raines via slurm-users
/free caused by the bug in the gpu_nvml.c code So it is not truly clear where the underlying issue really is though but seems most likely a bug in the older version of NVML I had installed. Ideally though SLURM would have better handling of the slurmstepd processes crashing. -- Pa

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Paul Raines
le of tests the problem has gone away. Going to need to have real users test with real jobs. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 30 Jan 2024 9:01am, Paul Raines wrote: External Email - Use Caution I built 23.02.7 and tried that and had the same problems. BTW, I am u

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Paul Raines
derlying issue here and I wonder if the NVML library at the time of build is the key (though like I said I tried rebuiling with NVIDIA 470 and that still had the issue) -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 30 Jan 2024 3:36am, Fokke Dijkstra wrote: External Email -

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-28 Thread Paul Raines
28T17:33:58.771] debug: completed epilog for jobid 3679888 [2024-01-28T17:33:58.774] debug: JobId=3679888: sent epilog complete msg: rc = 0 -- Paul Raines (http://help.nmr.mgh.harvard.edu) Please note that this e-mail is not secure (encrypted). If you do not wish to continue communicati

[slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-28 Thread Paul Raines
p' state that claims to have jobs completing that are definite NOT running on the box BUT there are jobs running on the box that SLURM thinks are done --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS

Re: [slurm-users] CPUSpecList confusion

2022-12-15 Thread Paul Raines
Turns out on that new node I was running hwloc in a cgroup restricted to cores 0-13 so that was causing the issue. In an unrestricted cgroup shell, "hwloc-ls --only pu" works properly and gives me the correct SLURM mapping. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On T

Re: [slurm-users] CPUSpecList confusion

2022-12-15 Thread Paul Raines
. This command does work on all my other boxes so I do think using hwloc-ls is the "best" answer for getting the mapping on most hardware out there. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Thu, 15 Dec 2022 1:24am, Marcus Wagner wrote: Hi Paul, as Slurm uses hwloc, I was

Re: [slurm-users] CPUSpecList confusion

2022-12-14 Thread Paul Raines
U_IDs=3-6 3-6 On Wed, 14 Dec 2022 9:42am, Paul Raines wrote: Yes, I see that on some of my other machines too. So apicid is definitely not what SLURM is using but somehow just lines up that way on this one machine I have. I think the issue is cgroups counts starting at 0 all the cores on t

Re: [slurm-users] CPUSpecList confusion

2022-12-14 Thread Paul Raines
fo the apcid for processor 12 is 16 # scontrol -d show job 1967214 | grep CPU_ID Nodes=r17 CPU_IDs=8-11,20-23 Mem=51200 GRES= # cat /sys/fs/cgroup/cpuset/slurm/uid_5164679/job_1967214/cpuset.cpus 16-23 I am totally lost now. Seems totally random. SLURM devs? Any insight? -- Paul Rain

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
-- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 13 Dec 2022 9:52am, Sean Maxwell wrote: External Email - Use Caution In the slurm.conf manual they state the CpuSpecList ids are "abstract", and in the CPU management docs they enforce the notion that the abstract Slurm I

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
Hmm. Actually looks like confusion between CPU IDs on the system and what SLURM thinks the IDs are # scontrol -d show job 8 ... Nodes=foobar CPU_IDs=14-21 Mem=25600 GRES= ... # cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_8/cpuset.cpus.effective 7-10,39-42 -- Paul Raines

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
Oh but that does explain the CfgTRES=cpu=14. With the CpuSpecList below and SlurmdOffSpec I do get CfgTRES=cpu=50 so that makes sense. The issue remains that thought the number of cpus in CpuSpecList is taken into account, the exact IDs seem to be ignored. -- Paul Raines (http

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
--mem=25G \ --time=10:00:00 --cpus-per-task=8 --pty /bin/bash $ grep -i ^cpu /proc/self/status Cpus_allowed: 0780,0780 Cpus_allowed_list: 7-10,39-42 -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 12 Dec 2022 10:21am, Sean Maxwell wrote: Hi Paul, Nodename=foobar

[slurm-users] CPUSpecList confusion

2022-12-09 Thread Paul Raines
. --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129USA The information in this e-mail is intended only for the person to whom it is addressed. If

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-21 Thread Paul Raines
Almost all the 5 min+ time was in the bzip2. The mysqldump by itself was about 16 seconds. So I moved the bzip2 to its own separate line so the tables are only locked for the ~16 seconds -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Wed, 21 Sep 2022 3:49am, Ole Holm Nielsen wrote

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-20 Thread Paul Raines
rework it. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 19 Sep 2022 9:29am, Reed Dier wrote: I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit differently, namely instead of a systemctl reload, I am sending a specific SIGUSR2 signal, which is

[slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Paul Raines
ob with srun/salloc and not a job that has been running for days. Is it InactiveLimit that leads to the "inactivity time limit reached" message? Anyway, I have changed InactiveLimit=600 to see if that helps. ------- Paul Raines

Re: [slurm-users] Strange memory limit behavior with --mem-per-gpu

2022-04-08 Thread Paul Raines
Sorry, should have stated that before. I am running Slurm 20.11.3 on CentOS 8 Stream that I compiled myself back in June 2021. I will try to arrange an upgrade in the next few weeks. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Fri, 8 Apr 2022 4:02am, Bjørn-Helge Mevik wrote: Paul

Re: [slurm-users] Strange memory limit behavior with --mem-per-gpu

2022-04-07 Thread Paul Raines
[0]:~$ cat /sys/fs/cgroup/memory/slurm/uid_5829/job_1134068/memory.limit_in_bytes 8589934592 On Wed, 6 Apr 2022 3:30pm, Paul Raines wrote: I have a user who submitted an interactive srun job using: srun --mem-per-gpu 64 --gpus 1 --nodes 1 From sacct for this job we see

[slurm-users] Strange memory limit behavior with --mem-per-gpu

2022-04-06 Thread Paul Raines
all the memory. ------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129USA The information in

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-03 Thread Paul Raines
On Thu, 3 Feb 2022 1:30am, Stephan Roth wrote: On 02.02.22 18:32, Michael Di Domenico wrote: On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote: The problem is to identify the cards physically from the information we have, like what's reported with nvidia-smi or available in /proc/drive

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-01 Thread Paul Raines
antee it or are their instances where it would be ignored there? -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 1 Feb 2022 3:09am, EPF (Esben Peter Friis) wrote: The numbering seen from nvidia-smi is not necessarily the same as the order of /dev/nvidiaXX. There is a way to force that

[slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-30 Thread Paul Raines
63425 C python 10849MiB | |4 N/A N/A 63426 C python 10849MiB | ++ How can I make SLURM not use GPU 2 and 4? ----

Re: [slurm-users] Calculate the GPU usages

2021-09-01 Thread Paul Raines
--- mlsc ***jl1103 *** gres/gpu 390 In slurm.conf for the partition all these jobs ran on I have TRESBillingWeights="CPU=1.24,Mem=0.02G,Gres/gpu=3.0" if that effects the sreport number somehow -- but then I would expect sreport's number to simply be 3x t

Re: [slurm-users] Restrict user not use node without GPU

2021-08-17 Thread Paul Raines
Then set JobSubmitPlugins=lua in slurm.conf I cannot find any documentation about what really should be in tres_per_job and tres_per_node as I would expect the cpu and memory requests in there but it is still "nil" even when those are given. For our cluster I have only seen it non

Re: [slurm-users] Information about finished jobs

2021-06-14 Thread Paul Raines
ime but is is way less (11 minutes instead of nearly 9 hours) # /usr/bin/sstat -p -a --job=357305 --format=JobID,AveCPU JobID|AveCPU| 357305.extern|213503982334-14:25:51| 357305.batch|11:33.000| Any idea why this is? Also, what is that crazy number for AveCPU on 357305.extern? -- Paul Ra

Re: [slurm-users] [EXT] rejecting jobs that exceed QOS limits

2021-05-29 Thread Paul Raines
Ah, should have found that. Thanks. On Sat, 29 May 2021 12:08am, Sean Crosby wrote: Hi Paul, Try sacctmgr modify qos gputest set flags=DenyOnLimit Sean From: slurm-users on behalf of Paul Raines Sent: Saturday, 29 May 2021 12:48 To: slurm-users

[slurm-users] rejecting jobs that exceed QOS limits

2021-05-28 Thread Paul Raines
I want to dedicate one of our GPU servers for testing where users are only allowed to run 1 job at a time using 1 GPU and 8 cores of the server. So I put one server in a partition on its own and setup a QOS for it as follows: sacctmgr add qos gputest sacctmgr modify qos gputest set priority=

Re: [slurm-users] unable to create directory '/run/user/14325/dconf': Permission denied. dconf will not work properly.

2021-03-17 Thread Paul Raines
This is most likely because your XDG* environment variables are being copied into the job environment. We do the following in our taskprolog script echo "unset XDG_RUNTIME_DIR" echo "unset XDG_SESSION_ID" echo "unset XDG_DATA_DIRS" -- Paul Raines (http://h

Re: [slurm-users] slurm bank and sreport tres minute usage problem

2021-03-12 Thread Paul Raines
lling=18,cpu=4,gres/gpu=1,mem=512G,node=1 201124.0 2021-03-09T14:55:29 2021-03-10T08:13:07 17:17:38 cpu=4,gres/gpu=1,mem=512G,node=1 So the first job used all 24 hours of that day, the 2nd just 3 seconds (so ignore it) and the third about 9 hours and 5 minutes CPU = 24*60*3+(9*60+5)*4 = 6500

[slurm-users] cgroup clean up after "Kill task failed"

2021-02-16 Thread Paul Raines
uot;job_100418"'s in /sys/fs/cgroup without rebooting? ------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129USA

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Paul Raines
le on the node This also probably requires you to have ProctrackType=proctrack/cgroup TaskPlugin=task/affinity,task/cgroup GresTypes=gpu like I do -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 26 Jan 2021 3:40pm, Ole Holm Nielsen wrote: Thanks Paul! On 26-01-2021 21:11, Paul

Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-26 Thread Paul Raines
Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-31 CoreCnt=32 Links=-1,0,0,0,0,0,2,0,0,0 This is fine with me as I want SLURM to ignore GPU affinity on these nodes but it is curious. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 25 Jan 2021 10:07am, Paul Raines wrote: I tried

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Paul Raines
fault RPM SPEC is needed. I just run rpmbuild --tb slurm-20.11.3.tar.bz2 You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml' and see that /usr/lib64/slurm/gpu_nvml.so only exists on the one built on the GPU node. -- Paul Raines (http://help.nmr.mgh.harvard.edu) O

Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-25 Thread Paul Raines
I tried submitting jobs with --gres-flags=disable-binding but this has not made any difference. Jobs asking for GPUs are still only being run if a core defined in gres.conf for the GPU is free. Basically seems the option is ignored. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Sun

Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-24 Thread Paul Raines
;affinity enforcment" as it is more important that a job run with a GPU on its non-affinity socket than just wait and not run at all? Thanks -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Sat, 23 Jan 2021 3:19pm, Chris Samuel wrote: On Saturday, 23 January 2021 9:54:11 AM PST Paul R

Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-23 Thread Paul Raines
6.log Power= TresPerJob=gpu:1 MailUser=mu40 MailType=FAIL I don't see anything obvious here. Is it maybe the 7 day thing? If I submit my jobs for 7 days to the rtx6000 partition though I don't see the problem. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Thu, 21 Jan 2021 5

[slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-21 Thread Paul Raines
first,\ reduce_completing_frag,\ max_rpc_cnt=16 DependencyParameters=kill_invalid_depend So any idea why job 38687 is not being run on the rtx-06 node ------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS At

Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-29 Thread Paul Raines
ite } for pid=403840 comm="sshd" name="rtx-05_811.4294967295" dev="md122" ino=2228938 scontext=system_u:system_r:sshd_t:s0-s0:c0.c1023 tcontext=system_u:object_r:var_t:s0 tclass=sock_file permissive=1 -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 26 O

Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-26 Thread Paul Raines
: Cancelled pending job step with signal 2 srun: error: Unable to create step for job 808: Job/step already completing or completed But it just hung forever till I did a ^C thank -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Sat, 24 Oct 2020 3:43am, Juergen Salk wrote: Hi Paul, maybe

Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-26 Thread Paul Raines
shd[176647]: fatal: Access denied for user raines by PAM account configuration [preauth] -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Fri, 23 Oct 2020 11:12pm, Wensheng Deng wrote: Append ‘log_level=debug5’ to the pam_slurm_adopt line in system-auth, restart sshd, try a new job and ssh ses

[slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-23 Thread Paul Raines
I am running Slurm 20.02.3 on CentOS 7 systems. I have pam_slurm_adopt setup in /etc/pam.d/system-auth and slurm.conf has PrologFlags=Contain,X11 I also have masked systemd-logind But pam_slurm_adopt always denies login with "Access denied by pam_slurm_adopt: you have no active jobs on this n

Re: [slurm-users] Billing issue

2020-08-06 Thread Paul Raines
Bas Does that mean you are setting PriorityFlags=MAX_TRES ? Also does anyone understand this from the slurm.conf docs: The weighted amount of a resource can be adjusted by adding a suffix of K,M,G,T or P after the billing weight. For example, a memory weight of "mem=.25" on a job allocat

Re: [slurm-users] cgroup limits not created for jobs

2020-07-26 Thread Paul Raines
On Sat, 25 Jul 2020 2:00am, Chris Samuel wrote: On Friday, 24 July 2020 9:48:35 AM PDT Paul Raines wrote: But when I run a job on the node it runs I can find no evidence in cgroups of any limits being set Example job: mlscgpu1[0]:~$ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1

[slurm-users] cgroup limits not created for jobs

2020-07-24 Thread Paul Raines
/freezer/tasks /sys/fs/cgroup/systemd/user.slice/user-5829.slice/session-80624.scope/tasks --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th

Re: [slurm-users] GPU configuration not working

2020-07-23 Thread Paul Raines
PER_NODE=4 SLURM_SUBMIT_HOST=mlscgpu1 SLURM_JOB_PARTITION=batch SLURM_JOB_NUM_NODES=1 SLURM_MEM_PER_NODE=1024 mlscgpu1[0]:~$ But still no CUDA_VISIBLE_DEVICES is being set On Thu, 23 Jul 2020 10:32am, Paul Raines wrote: I have two systems in my cluster with GPUs. Their setup in slurm.conf is GresTypes=gp

[slurm-users] GPU configuration not working

2020-07-23 Thread Paul Raines
I have two systems in my cluster with GPUs. Their setup in slurm.conf is GresTypes=gpu NodeName=mlscgpu1 Gres=gpu:quadro_rtx_6000:10 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557 NodeName=mlscgpu2 Gres=gpu:quadro_rtx_6000:5 CPUs=64 Boards=1 SocketsP

[slurm-users] memory limit questions

2018-07-13 Thread Paul Raines
looping and not cleanly failing? But if that was the case why would slurmstepd see the memory step exceeded which if it did why did it not kill the process? ------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athin