Hello
We have another batch of new users and some more batches of large array jobs
with very short runtimes due to errors in the jobs or just by design. Trying to
deal with these issues, Setting ArrayTaskThrottle and user education, I had a
thought that it would be very nice to have a limit
Hello
We recently added new architectures to our compute and submit nodes and a
PATH gets generated based on the type of machine our users log into.
Unfortunately, this PATH is architecture dependent but it is getting copied
with slurm jobs to the compute nodes which could be a different ar
hiving/purging turned off so it
won't rearchive the data you restored.
-Paul Edmon-
On 12/10/2021 1:28 PM, Ransom, Geoffrey M. wrote:
Hello
Our slurmdbd database is getting rather large and affecting performance, but
we want to keep usage data around for a few years for metric purposes i
Hello
Our slurmdbd database is getting rather large and affecting performance, but
we want to keep usage data around for a few years for metric purposes in order
to figure out how our users work. I read a suggestion to have a backup DB which
has all the usage data synced to it for metric pur
I just caught up on my slurm-users backlog and saw this was probably answered
12 hours before I sent it.
Thanks past slurm users. Sorry for the unnecessary emails.
From: slurm-users On Behalf Of Ransom,
Geoffrey M.
Sent: Wednesday, April 21, 2021 5:36 PM
To: Slurm User Community List
Hello
We tossed an A100 card set up as 7 MIG (multi instance GPU) devices in
slurm, but the devices to refer to each MIG are not immediately obvious. (16
/dev/nvidia-cap ### devices were created.)
Is anyone familiar with this and know how to set up MIG devices as cgroup
controlled TRES in s
Hello
We are setting up a lua cli_filter to coerce users to set certain options
for their jobs (tracked but non-accounted wckeys) and noticed that the
cli_filer.lua script runs for srun step and job allocations.
How should a cli_filter.lua script detect if it is inside a step or job
allocat
From: slurm-users On Behalf Of Ransom,
Geoffrey M.
Sent: Thursday, March 4, 2021 4:20 PM
To: Slurm User Community List
Subject: [EXT] Alert-Verify-Sender: [slurm-users] slurm and device cgroups
APL external email warning: Verify sender
slurm-users-boun...@lists.schedmd.com&
Hello
I am trying to debug an issue with EGL support (updated NVIDIA drivers and
now EGLGetDisplay and EGLQueryDevicesExt are failing if they can't access all
/dev/nvidia# devices in slurm) and am wondering how slurm uses device cgroups
so I can implement the same cgroup setup by hand and te
Hello
I have a ticket posted with schedmd, but this may be an issue the community
has seen and may have a quick response.
Slurmctld segfaulted (signal 11) on us and now segfaults on restart. I'm not
aware of an obvious trigger for this behavior.
We upgraded this cluster from 20.02.5 to 20.11
Hello
Is there a good document on creating a cli_filter somewhere?
I want to write something to make sure users add a wckey matching a specific
regexp to their jobs. We want metrics on what projects jobs were run on, but
don't have reliable enough project information to trust managing them i
Hello
We had a user post a large number of array jobs with a short actual run time
(20-80 seconds, but mostly to the low end) and slurmctld was falling behind on
RPC calls trying to handle the jobs. It was a bit awkward trying to slap
arraytaskthrottle=5 on each of the queued array jobs whil
Hello
We are having performance issues with slurtmctld (delayed sinfo/squeue
results, socket timeouts for multiple sbatch calls, jobs/nodes sitting in COMP
state for an extended period of time).
We just fully switch to Slurm from Univa and I think our problem is users
putting a lot of "sco
Hello
I defined "TmpDisk=93" for some machines in slurm 20.02.3 (and TmpFS is
set to a local volume slightly bigger than that) and when I run...
sbatch --tmp=10 -w node01 -array=1-100 -wrap="sleep 300"
I ended up with 36 jobs on the machine at a time, 1 per CPU core. I expect the
I'm just curious as to what causes a user to decide that a given node has
an issue?
If a node is healthy in all respects, why would a user decide not to use
the node?
Not enough free TMPDIR space, a GPU starts having memory errors, or a machine
with a temporary issue that slurm hea
is help?
El jue., 4 jun. 2020 a las 16:03, Ransom, Geoffrey M.
(mailto:geoffrey.ran...@jhuapl.edu>>) escribió:
Hello
We are moving from Univa(sge) to slurm and one of our users has jobs that if
they detect a failure on the current machine they add that machine to their
exclude list
Hello
We are moving from Univa(sge) to slurm and one of our users has jobs that if
they detect a failure on the current machine they add that machine to their
exclude list and requeue themselves. The user wants to emulate that behavior in
slurm.
It seems like "scontrol update job ${SLURM_JO
BLUF:
Is the Nvidia MPS service required for the MPS gres to function in slurm
with multiple GPUs in a single machine? (jobs using MPS don't need to span
GPUs, just use a part of a GPU in a machine with multiple GPUs)
Is there more detailed documentation available on how MPS should be
27;t let 100% of the compute resources get tied up with mutli-week long jobs.
Thanks.
On 12/16/2019 2:29 PM, Ransom, Geoffrey M. wrote:
Hello
I am looking into switching from Univa (sge) to slurm and am figuring out
how to implement some of our usage policy in slurm.
We have a Univa queue wh
Hello
I am looking into switching from Univa (sge) to slurm and am figuring out
how to implement some of our usage policy in slurm.
We have a Univa queue which uses job classes and RQSes to limit jobs with a run
time over 4 hours to only half the available slots (CPU cores) so some slots
ar
Hello
We are testing out slurm with the intent of replacing Univa (sge based) in
our environment. I was wondering if there was a guide mapping Univa/sge
concepts to slurm concepts that would assist in converting our Univa
usage/schedule policy into a slurm setup?
In particular, when using f
21 matches
Mail list logo