[slurm-users] Feature request: Max Jobs Per Minute

2024-09-09 Thread Ransom, Geoffrey M. via slurm-users
Hello We have another batch of new users and some more batches of large array jobs with very short runtimes due to errors in the jobs or just by design. Trying to deal with these issues, Setting ArrayTaskThrottle and user education, I had a thought that it would be very nice to have a limit

[slurm-users] How do you make --export=NONE the default behavior for our cluster?

2022-06-03 Thread Ransom, Geoffrey M.
Hello We recently added new architectures to our compute and submit nodes and a PATH gets generated based on the type of machine our users log into. Unfortunately, this PATH is architecture dependent but it is getting copied with slurm jobs to the compute nodes which could be a different ar

Re: [slurm-users] [EXT] Re: slurmdbd full backup so the primary can be purged

2021-12-14 Thread Ransom, Geoffrey M.
hiving/purging turned off so it won't rearchive the data you restored. -Paul Edmon- On 12/10/2021 1:28 PM, Ransom, Geoffrey M. wrote: Hello Our slurmdbd database is getting rather large and affecting performance, but we want to keep usage data around for a few years for metric purposes i

[slurm-users] slurmdbd full backup so the primary can be purged

2021-12-10 Thread Ransom, Geoffrey M.
Hello Our slurmdbd database is getting rather large and affecting performance, but we want to keep usage data around for a few years for metric purposes in order to figure out how our users work. I read a suggestion to have a backup DB which has all the usage data synced to it for metric pur

Re: [slurm-users] A100 MIG in slurm

2021-04-22 Thread Ransom, Geoffrey M.
I just caught up on my slurm-users backlog and saw this was probably answered 12 hours before I sent it. Thanks past slurm users. Sorry for the unnecessary emails. From: slurm-users On Behalf Of Ransom, Geoffrey M. Sent: Wednesday, April 21, 2021 5:36 PM To: Slurm User Community List

[slurm-users] A100 MIG in slurm

2021-04-21 Thread Ransom, Geoffrey M.
Hello We tossed an A100 card set up as 7 MIG (multi instance GPU) devices in slurm, but the devices to refer to each MIG are not immediately obvious. (16 /dev/nvidia-cap ### devices were created.) Is anyone familiar with this and know how to set up MIG devices as cgroup controlled TRES in s

[slurm-users] How does cli_filter.lua tell if srun is inside an existing allocation?

2021-03-20 Thread Ransom, Geoffrey M.
Hello We are setting up a lua cli_filter to coerce users to set certain options for their jobs (tracked but non-accounted wckeys) and noticed that the cli_filer.lua script runs for srun step and job allocations. How should a cli_filter.lua script detect if it is inside a step or job allocat

Re: [slurm-users] slurm and device cgroups

2021-03-04 Thread Ransom, Geoffrey M.
From: slurm-users On Behalf Of Ransom, Geoffrey M. Sent: Thursday, March 4, 2021 4:20 PM To: Slurm User Community List Subject: [EXT] Alert-Verify-Sender: [slurm-users] slurm and device cgroups APL external email warning: Verify sender slurm-users-boun...@lists.schedmd.com&

[slurm-users] slurm and device cgroups

2021-03-04 Thread Ransom, Geoffrey M.
Hello I am trying to debug an issue with EGL support (updated NVIDIA drivers and now EGLGetDisplay and EGLQueryDevicesExt are failing if they can't access all /dev/nvidia# devices in slurm) and am wondering how slurm uses device cgroups so I can implement the same cgroup setup by hand and te

[slurm-users] slurmctld segfaulting

2021-03-01 Thread Ransom, Geoffrey M.
Hello I have a ticket posted with schedmd, but this may be an issue the community has seen and may have a quick response. Slurmctld segfaulted (signal 11) on us and now segfaults on restart. I'm not aware of an obvious trigger for this behavior. We upgraded this cluster from 20.02.5 to 20.11

[slurm-users] Is there a cli_filter guide or tutorial?

2020-10-14 Thread Ransom, Geoffrey M.
Hello Is there a good document on creating a cli_filter somewhere? I want to write something to make sure users add a wckey matching a specific regexp to their jobs. We want metrics on what projects jobs were run on, but don't have reliable enough project information to trust managing them i

[slurm-users] Quickly throttling/limiting a specific user's jobs

2020-09-22 Thread Ransom, Geoffrey M.
Hello We had a user post a large number of array jobs with a short actual run time (20-80 seconds, but mostly to the low end) and slurmctld was falling behind on RPC calls trying to handle the jobs. It was a bit awkward trying to slap arraytaskthrottle=5 on each of the queued array jobs whil

[slurm-users] How to throttle sinfo/squeue/scontrol show so they don't throttle slurmctld

2020-08-17 Thread Ransom, Geoffrey M.
Hello We are having performance issues with slurtmctld (delayed sinfo/squeue results, socket timeouts for multiple sbatch calls, jobs/nodes sitting in COMP state for an extended period of time). We just fully switch to Slurm from Univa and I think our problem is users putting a lot of "sco

[slurm-users] TmpFS/tmpDisk/TMPDIR

2020-06-24 Thread Ransom, Geoffrey M.
Hello I defined "TmpDisk=93" for some machines in slurm 20.02.3 (and TmpFS is set to a local volume slightly bigger than that) and when I run... sbatch --tmp=10 -w node01 -array=1-100 -wrap="sleep 300" I ended up with 36 jobs on the machine at a time, 1 per CPU core. I expect the

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-10 Thread Ransom, Geoffrey M.
I'm just curious as to what causes a user to decide that a given node has an issue? If a node is healthy in all respects, why would a user decide not to use the node? Not enough free TMPDIR space, a GPU starts having memory errors, or a machine with a temporary issue that slurm hea

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Ransom, Geoffrey M.
is help? El jue., 4 jun. 2020 a las 16:03, Ransom, Geoffrey M. (mailto:geoffrey.ran...@jhuapl.edu>>) escribió: Hello We are moving from Univa(sge) to slurm and one of our users has jobs that if they detect a failure on the current machine they add that machine to their exclude list

[slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Ransom, Geoffrey M.
Hello We are moving from Univa(sge) to slurm and one of our users has jobs that if they detect a failure on the current machine they add that machine to their exclude list and requeue themselves. The user wants to emulate that behavior in slurm. It seems like "scontrol update job ${SLURM_JO

[slurm-users] gres:mps question

2020-01-09 Thread Ransom, Geoffrey M.
BLUF: Is the Nvidia MPS service required for the MPS gres to function in slurm with multiple GPUs in a single machine? (jobs using MPS don't need to span GPUs, just use a part of a GPU in a machine with multiple GPUs) Is there more detailed documentation available on how MPS should be

Re: [slurm-users] Partition question

2019-12-19 Thread Ransom, Geoffrey M.
27;t let 100% of the compute resources get tied up with mutli-week long jobs. Thanks. On 12/16/2019 2:29 PM, Ransom, Geoffrey M. wrote: Hello I am looking into switching from Univa (sge) to slurm and am figuring out how to implement some of our usage policy in slurm. We have a Univa queue wh

[slurm-users] Partition question

2019-12-16 Thread Ransom, Geoffrey M.
Hello I am looking into switching from Univa (sge) to slurm and am figuring out how to implement some of our usage policy in slurm. We have a Univa queue which uses job classes and RQSes to limit jobs with a run time over 4 hours to only half the available slots (CPU cores) so some slots ar

[slurm-users] Converting from Univa(sge) to slurm

2019-12-05 Thread Ransom, Geoffrey M.
Hello We are testing out slurm with the intent of replacing Univa (sge based) in our environment. I was wondering if there was a guide mapping Univa/sge concepts to slurm concepts that would assist in converting our Univa usage/schedule policy into a slurm setup? In particular, when using f