[slurm-users] Re: Best Way to See GPUs in Use?

2025-04-05 Thread Paul Edmon via slurm-users
If you do scontrol -d show node it will give what resources are actually being used in more details: [root@holy8a24507 general]# scontrol show node holygpu8a11101 NodeName=holygpu8a11101 Arch=x86_64 CoresPerSocket=48    CPUAlloc=70 CPUEfctv=96 CPUTot=96 CPULoad=173.07 AvailableFeatures=amd,holyn

[slurm-users] Re: cpus and gpus partitions and how to optimize the resource usage

2025-03-31 Thread Paul Edmon via slurm-users
To me at least the simplest solution would be to create 3 partitions. The first is for the cpu only nodes, the second is the gpu nodes and the third is a lower priority requeue partition. This is how we do it here. This way the requeue partition can be used to grab the cpu's on the gpu nodes wi

[slurm-users] Re: Job Env Vars in Slurm Core

2025-03-13 Thread Paul Edmon via slurm-users
Have you looked at this? https://slurm.schedmd.com/slurm.conf.html#OPT_job_env  Note that it will eat up a ton of space in the database, so be warned. -Paul Edmon- On 3/13/25 3:36 AM, Bhaskar Chakraborty via slurm-users wrote: Hi everyone, I have tried my best to extract custom job environme

[slurm-users] Re: formatting node names

2025-01-06 Thread Paul Edmon via slurm-users
You want: https://slurm.schedmd.com/scontrol.html#OPT_hostnames -Paul Edmon- On 1/6/2025 2:58 PM, Davide DelVento via slurm-users wrote: Found it, I should have asked to my puppet as it's mandatory in some places :-D It is simply scontrol show hostname gpu[01-02],node[03-04,12-22,27-32,36] S

[slurm-users] Re: Slurm Job Sched Priority

2024-11-03 Thread Paul Edmon via slurm-users
cement&c=Global_Acquisition_YMktg_315_Internal_EmailSignature&af_sub1=Acquisition&af_sub2=Global_YMktg&af_sub3=&af_sub4=10604&af_sub5=EmailSignature__Static_> On Tuesday, October 29, 2024, 7:43 PM, Paul Edmon via slurm-users wrote: If you are looking to use the C API for this then s

[slurm-users] Re: Slurm Job Sched Priority

2024-10-29 Thread Paul Edmon via slurm-users
If you are looking to use the C API for this then showq may be a good guide: https://github.com/fasrc/slurm_showq  The -o option orders the pending queue in priority order. If you are looking at native slurm commands, sprio can print out the current priority breakdown of any job and filter by

[slurm-users] Re: Jobs pending with reason "priority" but nodes are idle

2024-09-24 Thread Paul Edmon via slurm-users
You might need to do some tuning on your backfill loop as that loop should be the one that backfills in those lower priority jobs.  I would also look to see if those lower priority jobs will actually fit in prior to the higher priority job running, they may not. -Paul Edmon- On 9/24/24 2:19 P

[slurm-users] Re: Nodelist syntax and semantics

2024-09-05 Thread Paul Edmon via slurm-users
I think this might be the closest to one: https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION From the third paragraph: "Multiple node names may be comma separated (e.g. "alpha,beta,gamma") and/or a simple node range expression may optionally be used to specify numeric ranges

[slurm-users] Re: salloc not starting shell despite LaunchParameters=use_interactive_step

2024-09-05 Thread Paul Edmon via slurm-users
Its definitely working for 23.11.8, which is what we are using. -Paul Edmon- On 9/5/24 10:22 AM, Loris Bennett via slurm-users wrote: Jason Simms via slurm-users writes: Ours works fine, however, without the InteractiveStepOptions parameter. My assumption is also that default value should b

[slurm-users] Re: Print Slurm Stats on Login

2024-08-29 Thread Paul Edmon via slurm-users
'UNLIMITED','365-00:00:00').replace('Partition_Limit','365-00:00:00')) Cheers, Davide On Tue, Aug 27, 2024 at 1:40 PM Paul Edmon via slurm-users wrote: This thread when a bunch of different directions. However I ran with Jeffrey's suggestion and

[slurm-users] Re: Print Slurm Stats on Login

2024-08-27 Thread Paul Edmon via slurm-users
very 30 minutes. So long as the stats are publicly-visible anyway, put those summaries in a shared file system with open read access. Name the files by uid number. Now your /etc/profile.d script just cat's ${STATS_DIR}/$(id -u). On Aug 9, 2024, at 11:11, Paul Edmon via slurm-users

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Paul Edmon via slurm-users
e use Reframe for our testing: https://github.com/fasrc/reframe-fasrc). -Paul Edmon- On 8/26/2024 3:28 PM, Ole Holm Nielsen via slurm-users wrote: On 26-08-2024 20:30, Paul Edmon via slurm-users wrote: I haven't seen any behavior like that. For reference we are running Rocky 8.9 with MOFED 23.10.

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Paul Edmon via slurm-users
I haven't seen any behavior like that.  For reference we are running Rocky 8.9 with MOFED 23.10.2 -Paul Edmon- On 8/26/2024 2:23 PM, Ole Holm Nielsen via slurm-users wrote: Hi Paul, On 26-08-2024 15:29, Paul Edmon via slurm-users wrote: We've had this exact hardware for years no

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Paul Edmon via slurm-users
We've had this exact hardware for years now (all the CPU trays for Lenovo have been dual trays for the past few generations though previously they used a Y cable for connecting both). Basically the way we handle it is to drain its partner node whenever one goes down for a hardware issue. That

[slurm-users] Re: How to select a container runtime system?

2024-08-23 Thread Paul Edmon via slurm-users
We've been using Singularity for this for years with out much issue. It doesn't cover all use cases, but most applications work fine. We have not implemented this yet: https://slurm.schedmd.com/containers.html  But I intend to investigate it in the future. As of right now we just have the late

[slurm-users] Re: Annoying canonical question about converting SLURM_JOB_NODELIST to a host list for mpirun

2024-08-12 Thread Paul Edmon via slurm-users
AQ worthy? Definitely for my own Slurm FAQ. Others will decide if it is worthy for Slurm docs :) Thanks everyone for your help! Jeff On Mon, Aug 12, 2024 at 9:36 AM Paul Edmon via slurm-users wrote: Normally MPI will just pick up the host list from Slurm itsel

[slurm-users] Re: Annoying canonical question about converting SLURM_JOB_NODELIST to a host list for mpirun

2024-08-12 Thread Paul Edmon via slurm-users
ide if it is worthy for Slurm docs :) Thanks everyone for your help! Jeff On Mon, Aug 12, 2024 at 9:36 AM Paul Edmon via slurm-users wrote: Normally MPI will just pick up the host list from Slurm itself. You just need to build MPI against Slurm and it will just grab it. Typica

[slurm-users] Re: Annoying canonical question about converting SLURM_JOB_NODELIST to a host list for mpirun

2024-08-12 Thread Paul Edmon via slurm-users
M Hermann Schwärzler via slurm-users wrote: Hi Paul, On 8/9/24 18:45, Paul Edmon via slurm-users wrote: > As I recall I think OpenMPI needs a list that has an entry on each line, > rather than one seperated by a space. See: > > [root@holy7c26401 ~]# echo

[slurm-users] Re: Annoying canonical question about converting SLURM_JOB_NODELIST to a host list for mpirun

2024-08-09 Thread Paul Edmon via slurm-users
As I recall I think OpenMPI needs a list that has an entry on each line, rather than one seperated by a space. See: [root@holy7c26401 ~]# echo $SLURM_JOB_NODELIST holy7c[26401-26405] [root@holy7c26401 ~]# scontrol show hostnames $SLURM_JOB_NODELIST holy7c26401 holy7c26402 holy7c26403 holy7c26404

[slurm-users] Re: Print Slurm Stats on Login

2024-08-09 Thread Paul Edmon via slurm-users
e now-shuttered XSEDE program, and is useful for both system and user monitoring. -- A. On Fri, Aug 09, 2024 at 12:12:08PM -0400, Paul Edmon via slurm-users wrote: Yeah, I was contemplating doing that so I didn't have a dependency on the scheduler being up or down or busy.

[slurm-users] Re: Print Slurm Stats on Login

2024-08-09 Thread Paul Edmon via slurm-users
cess. Name the files by uid number. Now your /etc/profile.d script just cat's ${STATS_DIR}/$(id -u). On Aug 9, 2024, at 11:11, Paul Edmon via slurm-users wrote: We are working to make our users more aware of their usage. One of the ideas we came up with was to having some basic usage st

[slurm-users] Print Slurm Stats on Login

2024-08-09 Thread Paul Edmon via slurm-users
We are working to make our users more aware of their usage. One of the ideas we came up with was to having some basic usage stats printed at login (usage over past day, fairshare, job efficiency, etc). Does anyone have any scripts or methods that they use to do this? Before baking my own I was

[slurm-users] Re: Find out submit host of past job?

2024-08-07 Thread Paul Edmon via slurm-users
I think this would be a good feature request. At least to me everything you can get in scontrol show job should be in sacct in some form. -Paul Edmon- On 8/7/2024 9:29 AM, Steffen Grunewald wrote: On Wed, 2024-08-07 at 08:55:21 -0400, Slurm users wrote: Warning on that one, it can eat up a to

[slurm-users] Re: Find out submit host of past job?

2024-08-07 Thread Paul Edmon via slurm-users
Warning on that one, it can eat up a ton of database space (depending on size of environment, uniqueness of environment between jobs, and number of jobs). We had it on and it nearly ran us out of space on our database host. That said the data can be really useful depending on the situation. -P

[slurm-users] Re: Find out submit host of past job?

2024-08-07 Thread Paul Edmon via slurm-users
That looks to be the case from my glance at sacct. Not everything in scontrol show job ends up in sacct, which is a bit frustrating at times. -Paul Edmon- On 8/7/2024 8:08 AM, Steffen Grunewald via slurm-users wrote: Hello everyone, I've grepped the manual pages and crawled the 'net, but coul

[slurm-users] Re: Temporarily bypassing pam_slurm_adopt.so

2024-07-09 Thread Paul Edmon via slurm-users
We do this by adding groups/users to /etc/security/access.conf That should grant normal ssh access assuming you still have pam_access.so still in your sshd config.  Note that if the user has a job on the node, slurm will still shunt them into that job even with the access.conf setting.  So when

[slurm-users] Re: Unsupported RPC version by slurmctld 19.05.3 from client slurmd 22.05.11

2024-06-17 Thread Paul Edmon via slurm-users
https://slurm.schedmd.com/upgrades.html#compatibility_window Looks like no. You have to be with in 2 major releases. -Paul Edmon- On 6/17/24 5:40 AM, ivgeokig via slurm-users wrote: Hello!     I have a question. I have the server 19.05.3. No chance to upgrade it.   Have I any chance to conn

[slurm-users] Re: need to set From: address for slurm

2024-06-07 Thread Paul Edmon via slurm-users
There is no way to do it in slurm. You have to do it in the mail program you are using to send mail. In our case we use postfix and we set smtp_generic_maps to accomplish this. -Paul Edmon- On 6/7/2024 3:33 PM, Vanhorn, Mike via slurm-users wrote: All, When the slurm daemon is sending out e

[slurm-users] Re: dynamical configuration || meta configuration mgmt

2024-05-29 Thread Paul Edmon via slurm-users
Many parameters in slurm can be changed via scontrol and sacctmgr commands without updating the conf itself. The thing is that scontrol commands are not durable across restarts. sacctmgr though update the slurmdb and thus will be sticky. That's at least what I would do is that if you are using

[slurm-users] HPC Principal System Engineer at the Broad

2024-04-25 Thread Paul Edmon via slurm-users
A friend ask me to pass this along. Figured some folks on this list might be interested. https://broadinstitute.avature.net/en_US/careers/JobDetail/HPC-Principal-System-Engineer/17773 -Paul Edmon- -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slu

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Paul Edmon via slurm-users
Usually to clear jobs like this you have to reboot the node they are on. That will then force the scheduler to clear them. -Paul Edmon- On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote: We are running a slurm cluster with version `slurm 22.05.8`. One of our users has reported t

[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Paul Edmon via slurm-users
I wrote a little blog post on this topic a few years back: https://www.rc.fas.harvard.edu/blog/cluster-fragmentation/ It's a vexing problem, but as noted by the other responders it is something that depends on your cluster policy and job performance needs. Well written MPI code should be able

[slurm-users] Re: FairShare priority questions

2024-03-27 Thread Paul Edmon via slurm-users
For this use case you probably want to go with Classic Fairshare (https://slurm.schedmd.com/classic_fair_share.html) rather than FairTree. Classic Fairshare behaves in a way similar to what you describe. You can set up different bins for fairshare and then the user can pull from them. So that w

[slurm-users] Slurm Utilities

2024-03-13 Thread Paul Edmon via slurm-users
Just wanted to share some slurm utilities that we've written at Harvard FASRC that maybe useful to the community. seff-account: https://github.com/fasrc/seff-account  Creates job statistics summaries for users and accounts similar to what seff and seff-array does. showq: https://github.com/f

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
ms you MUST use srun -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Wed, 28 Feb 2024 10:25am, Paul Edmon via slurm-users wrote:   External Email - Use Caution salloc is the currently recommended way for interactive sessions. srun is now intended for launching steps or MPI applicatio

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
salloc is the currently recommended way for interactive sessions. srun is now intended for launching steps or MPI applications. So properly you would salloc and then srun inside the salloc. As you've noticed with srun you tend lose control of your shell as it takes over so you have background

[slurm-users] Re: Question about IB and Ethernet networks

2024-02-26 Thread Paul Edmon via slurm-users
I concur with what folks have written so far, it really depends on your use case. For instance if you are looking at a cluster with GPU's and intend to do some serious computing there you are going to need RDMA of some sort. But it all depends on what you end up needing for your workflows. For

[slurm-users] Re: Recover Batch Script Error

2024-02-16 Thread Paul Edmon via slurm-users
Are you using the job_script storage option? If so then you should be able to get at it by doing: sacct -B j JOBID https://slurm.schedmd.com/sacct.html#OPT_batch-script -Paul Edmon- On 2/16/2024 2:41 PM, Jason Simms via slurm-users wrote: Hello all, I've used the "scontrol write batch_scrip

[slurm-users] Re: Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Paul Edmon via slurm-users
You probably want the Prolog option: https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog along with: https://slurm.schedmd.com/slurm.conf.html#OPT_ForceRequeueOnFail -Paul Edmon- On 2/14/2024 8:38 AM, Cutts, Tim via slurm-users wrote: Hi, I apologise if I’ve failed to find this in the docum