If you do scontrol -d show node it will give what resources are actually
being used in more details:
[root@holy8a24507 general]# scontrol show node holygpu8a11101
NodeName=holygpu8a11101 Arch=x86_64 CoresPerSocket=48
CPUAlloc=70 CPUEfctv=96 CPUTot=96 CPULoad=173.07
AvailableFeatures=amd,holyn
To me at least the simplest solution would be to create 3 partitions.
The first is for the cpu only nodes, the second is the gpu nodes and the
third is a lower priority requeue partition. This is how we do it here.
This way the requeue partition can be used to grab the cpu's on the gpu
nodes wi
Have you looked at this?
https://slurm.schedmd.com/slurm.conf.html#OPT_job_env Note that it will
eat up a ton of space in the database, so be warned.
-Paul Edmon-
On 3/13/25 3:36 AM, Bhaskar Chakraborty via slurm-users wrote:
Hi everyone,
I have tried my best to extract custom job environme
You want: https://slurm.schedmd.com/scontrol.html#OPT_hostnames
-Paul Edmon-
On 1/6/2025 2:58 PM, Davide DelVento via slurm-users wrote:
Found it, I should have asked to my puppet as it's mandatory in some
places :-D
It is simply
scontrol show hostname gpu[01-02],node[03-04,12-22,27-32,36]
S
cement&c=Global_Acquisition_YMktg_315_Internal_EmailSignature&af_sub1=Acquisition&af_sub2=Global_YMktg&af_sub3=&af_sub4=10604&af_sub5=EmailSignature__Static_>
On Tuesday, October 29, 2024, 7:43 PM, Paul Edmon via slurm-users
wrote:
If you are looking to use the C API for this then s
If you are looking to use the C API for this then showq may be a good
guide: https://github.com/fasrc/slurm_showq The -o option orders the
pending queue in priority order.
If you are looking at native slurm commands, sprio can print out the
current priority breakdown of any job and filter by
You might need to do some tuning on your backfill loop as that loop
should be the one that backfills in those lower priority jobs. I would
also look to see if those lower priority jobs will actually fit in prior
to the higher priority job running, they may not.
-Paul Edmon-
On 9/24/24 2:19 P
I think this might be the closest to one:
https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION
From the third paragraph:
"Multiple node names may be comma separated (e.g. "alpha,beta,gamma")
and/or a simple node range expression may optionally be used to specify
numeric ranges
Its definitely working for 23.11.8, which is what we are using.
-Paul Edmon-
On 9/5/24 10:22 AM, Loris Bennett via slurm-users wrote:
Jason Simms via slurm-users writes:
Ours works fine, however, without the InteractiveStepOptions parameter.
My assumption is also that default value should b
'UNLIMITED','365-00:00:00').replace('Partition_Limit','365-00:00:00'))
Cheers,
Davide
On Tue, Aug 27, 2024 at 1:40 PM Paul Edmon via slurm-users
wrote:
This thread when a bunch of different directions. However I ran with
Jeffrey's suggestion and
very 30 minutes. So long as the stats
are publicly-visible anyway, put those summaries in a shared file system with
open read access. Name the files by uid number. Now your /etc/profile.d
script just cat's ${STATS_DIR}/$(id -u).
On Aug 9, 2024, at 11:11, Paul Edmon via slurm-users
e use
Reframe for our testing: https://github.com/fasrc/reframe-fasrc).
-Paul Edmon-
On 8/26/2024 3:28 PM, Ole Holm Nielsen via slurm-users wrote:
On 26-08-2024 20:30, Paul Edmon via slurm-users wrote:
I haven't seen any behavior like that. For reference we are running
Rocky 8.9 with MOFED 23.10.
I haven't seen any behavior like that. For reference we are running
Rocky 8.9 with MOFED 23.10.2
-Paul Edmon-
On 8/26/2024 2:23 PM, Ole Holm Nielsen via slurm-users wrote:
Hi Paul,
On 26-08-2024 15:29, Paul Edmon via slurm-users wrote:
We've had this exact hardware for years no
We've had this exact hardware for years now (all the CPU trays for
Lenovo have been dual trays for the past few generations though
previously they used a Y cable for connecting both). Basically the way
we handle it is to drain its partner node whenever one goes down for a
hardware issue.
That
We've been using Singularity for this for years with out much issue. It
doesn't cover all use cases, but most applications work fine.
We have not implemented this yet:
https://slurm.schedmd.com/containers.html But I intend to investigate
it in the future. As of right now we just have the late
AQ worthy? Definitely for my own Slurm FAQ. Others will decide
if it is worthy for Slurm docs :)
Thanks everyone for your help!
Jeff
On Mon, Aug 12, 2024 at 9:36 AM Paul Edmon via slurm-users
wrote:
Normally MPI will just pick up the host list from Slurm
itsel
ide if it is
worthy for Slurm docs :)
Thanks everyone for your help!
Jeff
On Mon, Aug 12, 2024 at 9:36 AM Paul Edmon via slurm-users
wrote:
Normally MPI will just pick up the host list from Slurm itself.
You just need to build MPI against Slurm and it will just grab it.
Typica
M Hermann Schwärzler via slurm-users
wrote:
Hi Paul,
On 8/9/24 18:45, Paul Edmon via slurm-users wrote:
> As I recall I think OpenMPI needs a list that has an entry on
each line,
> rather than one seperated by a space. See:
>
> [root@holy7c26401 ~]# echo
As I recall I think OpenMPI needs a list that has an entry on each line,
rather than one seperated by a space. See:
[root@holy7c26401 ~]# echo $SLURM_JOB_NODELIST
holy7c[26401-26405]
[root@holy7c26401 ~]# scontrol show hostnames $SLURM_JOB_NODELIST
holy7c26401
holy7c26402
holy7c26403
holy7c26404
e now-shuttered
XSEDE program, and is useful for both system and user monitoring.
-- A.
On Fri, Aug 09, 2024 at 12:12:08PM -0400, Paul Edmon via slurm-users wrote:
Yeah, I was contemplating doing that so I didn't have a dependency on the
scheduler being up or down or busy.
cess. Name the files by uid number. Now your /etc/profile.d
script just cat's ${STATS_DIR}/$(id -u).
On Aug 9, 2024, at 11:11, Paul Edmon via slurm-users
wrote:
We are working to make our users more aware of their usage. One of the ideas we
came up with was to having some basic usage st
We are working to make our users more aware of their usage. One of the
ideas we came up with was to having some basic usage stats printed at
login (usage over past day, fairshare, job efficiency, etc). Does anyone
have any scripts or methods that they use to do this? Before baking my
own I was
I think this would be a good feature request. At least to me everything
you can get in scontrol show job should be in sacct in some form.
-Paul Edmon-
On 8/7/2024 9:29 AM, Steffen Grunewald wrote:
On Wed, 2024-08-07 at 08:55:21 -0400, Slurm users wrote:
Warning on that one, it can eat up a to
Warning on that one, it can eat up a ton of database space (depending on
size of environment, uniqueness of environment between jobs, and number
of jobs). We had it on and it nearly ran us out of space on our database
host. That said the data can be really useful depending on the situation.
-P
That looks to be the case from my glance at sacct. Not everything in
scontrol show job ends up in sacct, which is a bit frustrating at times.
-Paul Edmon-
On 8/7/2024 8:08 AM, Steffen Grunewald via slurm-users wrote:
Hello everyone,
I've grepped the manual pages and crawled the 'net, but coul
We do this by adding groups/users to /etc/security/access.conf That
should grant normal ssh access assuming you still have pam_access.so
still in your sshd config. Note that if the user has a job on the node,
slurm will still shunt them into that job even with the access.conf
setting. So when
https://slurm.schedmd.com/upgrades.html#compatibility_window
Looks like no. You have to be with in 2 major releases.
-Paul Edmon-
On 6/17/24 5:40 AM, ivgeokig via slurm-users wrote:
Hello!
I have a question. I have the server 19.05.3. No chance to upgrade
it. Have I any chance to conn
There is no way to do it in slurm. You have to do it in the mail program
you are using to send mail. In our case we use postfix and we set
smtp_generic_maps to accomplish this.
-Paul Edmon-
On 6/7/2024 3:33 PM, Vanhorn, Mike via slurm-users wrote:
All,
When the slurm daemon is sending out e
Many parameters in slurm can be changed via scontrol and sacctmgr
commands without updating the conf itself. The thing is that scontrol
commands are not durable across restarts. sacctmgr though update the
slurmdb and thus will be sticky.
That's at least what I would do is that if you are using
A friend ask me to pass this along. Figured some folks on this list
might be interested.
https://broadinstitute.avature.net/en_US/careers/JobDetail/HPC-Principal-System-Engineer/17773
-Paul Edmon-
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slu
Usually to clear jobs like this you have to reboot the node they are on.
That will then force the scheduler to clear them.
-Paul Edmon-
On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote:
We are running a slurm cluster with version `slurm 22.05.8`. One of our users
has reported t
I wrote a little blog post on this topic a few years back:
https://www.rc.fas.harvard.edu/blog/cluster-fragmentation/
It's a vexing problem, but as noted by the other responders it is
something that depends on your cluster policy and job performance needs.
Well written MPI code should be able
For this use case you probably want to go with Classic Fairshare
(https://slurm.schedmd.com/classic_fair_share.html) rather than
FairTree. Classic Fairshare behaves in a way similar to what you
describe. You can set up different bins for fairshare and then the user
can pull from them. So that w
Just wanted to share some slurm utilities that we've written at Harvard
FASRC that maybe useful to the community.
seff-account: https://github.com/fasrc/seff-account Creates job
statistics summaries for users and accounts similar to what seff and
seff-array does.
showq: https://github.com/f
ms you MUST use srun
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Wed, 28 Feb 2024 10:25am, Paul Edmon via slurm-users wrote:
External Email - Use Caution salloc is the currently
recommended way for interactive sessions. srun is now intended for
launching steps or MPI applicatio
salloc is the currently recommended way for interactive sessions. srun
is now intended for launching steps or MPI applications. So properly you
would salloc and then srun inside the salloc.
As you've noticed with srun you tend lose control of your shell as it
takes over so you have background
I concur with what folks have written so far, it really depends on your
use case. For instance if you are looking at a cluster with GPU's and
intend to do some serious computing there you are going to need RDMA of
some sort. But it all depends on what you end up needing for your workflows.
For
Are you using the job_script storage option? If so then you should be
able to get at it by doing:
sacct -B j JOBID
https://slurm.schedmd.com/sacct.html#OPT_batch-script
-Paul Edmon-
On 2/16/2024 2:41 PM, Jason Simms via slurm-users wrote:
Hello all,
I've used the "scontrol write batch_scrip
You probably want the Prolog option:
https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog along with:
https://slurm.schedmd.com/slurm.conf.html#OPT_ForceRequeueOnFail
-Paul Edmon-
On 2/14/2024 8:38 AM, Cutts, Tim via slurm-users wrote:
Hi, I apologise if I’ve failed to find this in the docum
39 matches
Mail list logo