[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-03 Thread Renfro, Michael via slurm-users
is small enough to where we’ve never had a load or other performance issue with our AD. From: Steven Jones Date: Monday, February 3, 2025 at 2:14 PM To: Renfro, Michael , slurm-users@lists.schedmd.com , Chris Samuel Subject: Re: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld External Email

[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-03 Thread Renfro, Michael via slurm-users
Late to the party here, but depending on how much time you have invested, how much you can tolerate reformats or other more destructive work, etc., you might consider OpenHPC and its install guide ([1] for RHEL 8 variants, [2] or [3] for RHEL 9 variants, depending on which version of Warewulf yo

[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)

2024-12-03 Thread Renfro, Michael via slurm-users
that here without some astoundingly good reasons. From: Oren Date: Tuesday, December 3, 2024 at 3:36 PM To: Renfro, Michael Cc: slurm-us...@schedmd.com Subject: Re: [slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,) External Email Warning This

[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)

2024-12-03 Thread Renfro, Michael via slurm-users
at 3:15 PM To: Renfro, Michael Cc: slurm-us...@schedmd.com Subject: Re: [slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,) External Email Warning This email originated from outside the university. Please use caution when opening attachments

[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)

2024-12-03 Thread Renfro, Michael via slurm-users
I’ll start with the question of “why spread the jobs out more than required?” and move on to why the other items didn’t work: 1. exclusive only ensures that others’ jobs don’t run on a node with your jobs, and does nothing about other jobs you own. 2. spread-job distributes the work of on

[slurm-users] Re: How does --nodes=min[-max] determine number of nodes to allocate?

2024-10-08 Thread Renfro, Michael via slurm-users
Not so much about the source, but in the sbatch documentation [1], I think the --begin and --nodes parameters might interact. And yes, this is semi-educated speculation on my part. From the nodes= section, “The job will be allocated as many nodes as possible within the range specified and witho

[slurm-users] Re: Jobs pending with reason "priority" but nodes are idle

2024-09-25 Thread Renfro, Michael via slurm-users
riority jobs onto those nodes, since those jobs don’t require GPUs. It’s very unexpected behavior, to me. Is there an option somewhere I need to set? From: "Renfro, Michael" <mailto:ren...@tntech.edu> Date: Tuesday, September 24, 2024 at 1:54 PM To: Daniel Long <mailto:daniel

[slurm-users] Re: Jobs pending with reason "priority" but nodes are idle

2024-09-24 Thread Renfro, Michael via slurm-users
, they have a time request that conflicts with the scheduled start time for the high priority jobs. [1] https://slurm.schedmd.com/sched_config.html#backfill From: Long, Daniel S. Date: Tuesday, September 24, 2024 at 1:20 PM To: Renfro, Michael , slurm-us...@schedmd.com Subject: Re: Jobs pending

[slurm-users] Re: Jobs pending with reason "priority" but nodes are idle

2024-09-24 Thread Renfro, Michael via slurm-users
In theory, if jobs are pending with “Priority”, one or more other jobs will be pending with “Resources”. So a few questions: 1. What are the “Resources” jobs waiting on, resource-wise? 2. When are they scheduled to start? 3. Can your array jobs backfill into the idle resources and fini

[slurm-users] Re: FairShare if there's only one account?

2024-08-09 Thread Renfro, Michael via slurm-users
bsmith’s jobs should start earlier than csmith’s. From: Drucker, Daniel Date: Friday, August 9, 2024 at 3:11 PM To: Renfro, Michael Cc: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] FairShare if there's only one account? External Email Warning This email originated from outsid

[slurm-users] Re: FairShare if there's only one account?

2024-08-09 Thread Renfro, Michael via slurm-users
I don’t have any 21.08 systems to verify with, but that’s how I remember it. Use “sshare -a -A mic” to verify. You should see both a RawShares and a NormShares column for each user. By default they’ll all have the same value, but they can be adjusted if needed. From: Drucker, Daniel via slurm-u

[slurm-users] Re: The issue in the distribution of job

2024-08-09 Thread Renfro, Michael via slurm-users
It may be difficult to narrow down the problem without knowing what commands you're running inside the salloc session. For example, if it's a pure OpenMP program, it can't use more than one node. From: Sundaram Kumaran via slurm-users Sent: Friday, August 9, 2024

[slurm-users] Re: Software builds using slurm

2024-06-10 Thread Renfro, Michael via slurm-users
At a certain point, you’re talking about workflow orchestration. Snakemake [1] and its slurm executor plugin [2] may be a starting point, especially since Snakemake is a local-by-default tool. I wouldn’t try reproducing your entire “make” workflow in Snakemake. Instead, I’d define the roughly 60

[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Renfro, Michael via slurm-users
looser rules, but not the core main/contrib/non-free repositories. From: Renfro, Michael Date: Wednesday, May 15, 2024 at 10:19 AM To: Jeffrey Layton , Lloyd Brown Cc: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Re: Location of Slurm source packages? Debian/Ubuntu sources can always

[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Renfro, Michael via slurm-users
Debian/Ubuntu sources can always be found in at least two ways: 1. Pages like https://packages.ubuntu.com/jammy/slurm-wlm (see the .dsc, .orig.tar.gz, and .debian.tar.xz links there). 2. Commands like ‘apt-get source slurm-wlm’ (may require ‘dpkg-dev’ or other packages – probably easiest

[slurm-users] Re: [EXT] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users
://slurm.schedmd.com/slurm.conf.html#OPT_DefMemPerCPU From: Alison Peterson Date: Thursday, April 4, 2024 at 11:58 AM To: Renfro, Michael Subject: Re: [EXT] Re: [slurm-users] SLURM configuration help External Email Warning This email originated from outside the university. Please use caution when

[slurm-users] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users
What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID” show? On one job we currently have that’s pending due to Resources, that job has requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the node it wants to run on only has 37 CPUs available (seen by

[slurm-users] Re: SLURM configuration for LDAP users

2024-02-04 Thread Renfro, Michael via slurm-users
“An LDAP user can login to the login, slurmctld and compute nodes, but when they try to submit jobs, slurmctld logs an error about invalid account or partition for user.” Since I don’t think it was mentioned below, does a non-LDAP user get the same error, or does it work by default? We don’t u

Re: [slurm-users] Slurp for sw builds

2024-01-03 Thread Renfro, Michael
You can attack this in a few different stages. A lot of what you’re interested in will be found at various university or national lab sites (I Googled “sbatch example” for the one below) 1. If you’re good with doing a “make -j” to parallelize a make compilation over multiple CPUs in a singl

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Renfro, Michael
Is this Northwestern’s Quest HPC or another one? I know at least a few of the people involved with Quest, and I wouldn’t have thought they’d be in dire need of coaching. And to follow on with Davide’s point, this really sounds like a case for submitting multiple jobs with dependencies between t

Re: [slurm-users] Guidance on which HPC to try our "OpenHPC or TrintyX " for novice

2023-10-03 Thread Renfro, Michael
I’d probably default to OpenHPC just for the community around it, but I’ll also note that TrinityX might not have had any commits in their GitHub for an 18-month period (unless I’m reading something wrong). On Oct 3, 2023, at 5:51 AM, John Joseph wrote:  External Email Warning This email or

Re: [slurm-users] extended list of nodes allocated to a job

2023-08-17 Thread Renfro, Michael
Given a job ID: scontrol show hostnames $(scontrol show job some_job_id | grep ' NodeList=' | cut -d= -f2) | paste -sd, Maybe there’s something more built-in than this, but it gets the job done. From: slurm-users on behalf of Alain O' Miniussi Date: Thursday, August 17, 2023 at 7:46 AM To: S

Re: [slurm-users] On the ability of coordinators

2023-05-17 Thread Renfro, Michael
If there’s a fairshare component to job priorities, and there’s a share assigned to each user under the account, wouldn’t the light user’s jobs move ahead of any of the heavy user’s pending jobs automatically? From: slurm-users on behalf of "Groner, Rob" Reply-To: Slurm User Community List D

Re: [slurm-users] Allow regular users to make reservations

2022-08-08 Thread Renfro, Michael
Going in a completely different direction than you’d planned, but for the same goal, what about making a script (shell, Python, or otherwise) that could validate all the constraints and call the scontrol program if appropriate, and then run that script via “sudo” as one of the regular users? Fr

Re: [slurm-users] Changing a user's default account

2022-08-05 Thread Renfro, Michael
This should work: sacctmgr add user someuser account=newaccount # adds user to new account sacctmgr modify user where user=someuser set defaultaccount=newaccount # change default sacctmgr remove user where user=someuser and account=oldaccount # remove from old account From: slurm-users on be

Re: [slurm-users] Sharing a GPU

2022-04-03 Thread Renfro, Michael
Someone else may see another option, but NVIDIA MIG seems like the straightforward option. That would require both a Slurm upgrade and the purchase of MIG-capable cards. https://slurm.schedmd.com/gres.html#MIG_Management Would be able to host 7 users per A100 card, IIRC. On Apr 3, 2022, at 4:2

Re: [slurm-users] Performance with hybrid setup

2022-03-13 Thread Renfro, Michael
Slurm supports a l3_cache_as_socket [1] parameter in recent releases. That would make an Epyc system, for example, appear to have many more sockets than physically exist, and that should help ensure threads in a single task share a cache. You’d want to run slurmd -C on a node with that setting

Re: [slurm-users] Can job submit plugin detect "--exclusive" ?

2022-02-22 Thread Renfro, Michael
For later reference, [1] should be the (current) authoritative source on data types for the job_desc values: some strings, some numbers, some booleans. [1] https://github.com/SchedMD/slurm/blob/4c21239d420962246e1ac951eda90476283e7af0/src/plugins/job_submit/lua/job_submit_lua.c#L450 From: slurm

Re: [slurm-users] Fairshare within a single Account (Project)

2022-02-01 Thread Renfro, Michael
c Mathematical Modeling and Analysis TU Darmstadt Tel: +49 6151 16-21469 Alarich-Weiss-Straße 10 64287 Darmstadt Office: L2|06 410 On 1/30/22 21:14, Renfro, Michael wrote: You can. We use: sacctmgr show assoc where account=researchgroup format=user,share to see current fairs

Re: [slurm-users] Fairshare within a single Account (Project)

2022-01-30 Thread Renfro, Michael
You can. We use: sacctmgr show assoc where account=researchgroup format=user,share to see current fairshare within the account, and: sacctmgr modify user where name=someuser account=researchgroup set fairshare=N to modify a particular user's fairshare within the account. From:

Re: [slurm-users] how to allocate high priority to low cpu and memory jobs

2022-01-25 Thread Renfro, Michael
Since there's only 9 factors to assign priority weights to, one way around this might be to set up separate partitions for high memory and low memory jobs (with a max memory allowed for the low memory partition), and then use partition weights to separate those jobs out. From: slurm-users on b

Re: [slurm-users] Questions about default_queue_depth

2022-01-12 Thread Renfro, Michael
Not answering every question below, but for (1) we're at 200 on a cluster with a few dozen nodes and around 1k cores, as per https://lists.schedmd.com/pipermail/slurm-users/2021-June/007463.html -- there may be other settings in that email that could be beneficial. We had a lot of idle resource

Re: [slurm-users] work with sensitive data

2021-12-17 Thread Renfro, Michael
Untested, but given a common service account with a GPG key pair, a user with a GPG key pair, and the EncFS encrypted with a password, the user could encrypt a password with their own private key and the service account's public key, and leave it alongside the EncFS. If the service account is m

Re: [slurm-users] Reserving cores without immediately launching tasks on all of them

2021-11-26 Thread Renfro, Michael
billing=24 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s From: slurm-users On Behalf Of Renfro, Michael Sent: Friday, November 26, 2021 8:15 AM To: Slurm User Community List Subject: [EXTERNAL] Re: [slurm-users] Reserv

Re: [slurm-users] Reserving cores without immediately launching tasks on all of them

2021-11-26 Thread Renfro, Michael
The end of the MPICH section at [1] shows an example using salloc [2]. Worst case, you should be able to use the output of “scontrol show hostnames” [3] and use that data to make mpiexec command parameters to run one rank per node, similar to what’s shown at the end of the synopsis section of [4

Re: [slurm-users] EXTERNAL-Re: Block jobs on GPU partition when GPU is not specified

2021-09-27 Thread Renfro, Michael
return slurm.ERROR end end end Fritz Ratnasamy Data Scientist Information Technology The University of Chicago Booth School of Business 5807 S. Woodlawn Chicago, Illinois 60637 Phone: +(1) 773-834-4556 On Mon, Sep 27, 2021 at 1:40 PM Renfro, Michael mailto:re

Re: [slurm-users] EXTERNAL-Re: Block jobs on GPU partition when GPU is not specified

2021-09-27 Thread Renfro, Michael
in in slurm.conf/ is there any Slurm service to restart after that? Thanks again Fritz Ratnasamy Data Scientist Information Technology The University of Chicago Booth School of Business 5807 S. Woodlawn Chicago, Illinois 60637 Phone: +(1) 773-834-4556 On Sat, Sep 25, 2021 at 11:08 AM Renfro, Mic

Re: [slurm-users] Block jobs on GPU partition when GPU is not specified

2021-09-25 Thread Renfro, Michael
If you haven't already seen it there's an example Lua script from SchedMD at [1], and I've got a copy of our local script at [2]. Otherwise, in the order you asked: 1. That seems reasonable, but our script just checks if there's a gres at all. I don't *think* any gres other than gres=gpu wo

Re: [slurm-users] Regarding job in pending state

2021-09-16 Thread Renfro, Michael
If you're not the cluster admin, you'll want to check with them, but that should be related to a limit in how many node-hours an association (a unique combination of user, cluster, partition, and account) can have in running or pending state. Further jobs would get blocked to allow others' jobs

Re: [slurm-users] estimate queue time using 'sbatch --test-only'

2021-09-15 Thread Renfro, Michael
I can imagine at least the following causing differences in the estimated time and the actual start time: * If running users have overestimated their job times, and their jobs finish earlier than expected, the original estimate will be high. * If another user's job submission gets highe

Re: [slurm-users] scancel gpu jobs when gpu is not requested

2021-08-26 Thread Renfro, Michael
Not a solution to your exact problem, but we document partitions for interactive, debug, and batch, and have a job_submit.lua [1] that routes GPU-reserving jobs to gpu-interactive, gpu-debug, and gpu partitions automatically. Since our GPU nodes have extra memory slots, and have tended to run a

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Renfro, Michael
Did Diego's suggestion from [1] not help narrow things down? [1] https://lists.schedmd.com/pipermail/slurm-users/2021-August/007708.html From: slurm-users on behalf of Jack Chen Date: Tuesday, August 10, 2021 at 10:08 AM To: Slurm User Community List Subject: Re: [slurm-users] Compact schedul

Re: [slurm-users] Slurm Scheduler Help

2021-06-11 Thread Renfro, Michael
Not sure it would work out to 60k queued jobs, but we're using: SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200 in our setup. bf_window is driven by our 30-day max job time, bf_resolution is at 5% of that time, and the other values ar

Re: [slurm-users] Kill job when child process gets OOM-killed

2021-06-09 Thread Renfro, Michael
tute of Translational Genomics Helmholtz-Centre Munich (HMGU) ------------- From: slurm-users On Behalf Of Renfro, Michael Sent: Tuesday, 8 June 2021 20:12 To: Slurm User Community List Subject: Re: [slurm-users] Kill job when child process gets OOM-killed Any reason *not* to create an array of 100k jobs and let

Re: [slurm-users] Kill job when child process gets OOM-killed

2021-06-08 Thread Renfro, Michael
Any reason *not* to create an array of 100k jobs and let the scheduler just handle things? Current versions of Slurm support arrays of up to 4M jobs, and you can limit the number of jobs running simultaneously with the '%' specifier in your array= sbatch parameter. From: slurm-users on behalf

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-05-14 Thread Renfro, Michael
Untested, but prior experience with cgroups indicates that if things are working correctly, even if your code tries to run as many processes as you have cores, those processes will be confined to the cores you reserve. Try a more compute-intensive worker function that will take some seconds or

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Renfro, Michael
lo.edu> you could inquire at. [1] https://github.com/ubccr/xdmod/releases/tag/v9.5.0-rc.4 From: Diego Zuccato Date: Wednesday, May 12, 2021 at 8:37 AM To: Renfro, Michael Cc: Slurm User Community List Subject: Re: [slurm-users] Cluster usage, filtered by partition Il 12/05/21 13:30, Die

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-12 Thread Renfro, Michael
://xdmod.ccr.buffalo.edu/ — may be the easiest way to explore it. On May 12, 2021, at 3:52 AM, Diego Zuccato wrote: Il 11/05/21 21:20, Renfro, Michael ha scritto: In a word, nothing that's guaranteed to be stable. I got my start from this reply on the XDMoD list in November 2019. Worked on 8.0: Tks for the

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Renfro, Michael
List Subject: Re: [slurm-users] Cluster usage, filtered by partition On Tue, May 11, 2021 at 5:55 AM Renfro, Michael wrote: > > XDMoD [1] is useful for this, but it’s not a simple script. It does have some > user-accessible APIs if you want some report automation. I’m using that to >

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Renfro, Michael
XDMoD [1] is useful for this, but it’s not a simple script. It does have some user-accessible APIs if you want some report automation. I’m using that to create a lightning-talk-style slide at [2]. [1] https://open.xdmod.org/ [2] https://github.com/mikerenfro/one-page-presentation-hpc On May 11,

Re: [slurm-users] Testing Lua job submit plugins

2021-05-06 Thread Renfro, Michael
I’ve used the structure at https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5 to handle basic test/production branching. I can isolate the new behavior down to just a specific set of UIDs that way. Factoring out code into separate functions helps, too. I’ve seen others go so f

Re: [slurm-users] [External] Slurm Configuration assistance: Unable to use srun after installation (slurm on fedora 33)

2021-04-19 Thread Renfro, Michael
You'll definitely need to get slurmd and slurmctld working before proceeding further. slurmctld is the Slurm controller mentioned when you do the srun. Though there's probably some other steps you can take to make the slurmd and slurmctld system services available, it might be simpler to do the

Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-16 Thread Renfro, Michael
I can't speak to what happens on node failure, but I can at least get you a greatly simplified pair of scripts that will run only one copy on each node allocated: #!/bin/bash # notarray.sh #SBATCH --nodes=28 #SBATCH --ntasks-per-node=1 #SBATCH --no-kill echo "notarray.sh is running on $(hostnam

Re: [slurm-users] derived counters

2021-04-13 Thread Renfro, Michael
I'll never miss an opportunity to plug XDMoD for anyone who doesn't want to write custom analytics for every metric. I've managed to get a little bit into its API to extract current values for number of jobs completed and the number of CPU-hours provided, and insert those into a single slide pre

Re: [slurm-users] [External] Autoset job TimeLimit to fit in a reservation

2021-03-30 Thread Renfro, Michael
I'd probably write a shell function that would calculate the time required, and add it as a command-line parameter to sbatch. We do a similar thing for easier interactive shells in our /etc/profile.d folder on the login node: function hpcshell() { srun --partition=interactive $@ --pty bash -i

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Renfro, Michael
Just a starting guess, but are you certain the MATLAB script didn’t try to allocate enormous amounts of memory for variables? That’d be about 16e9 floating point values, if I did the units correctly. On Mar 15, 2021, at 12:53 PM, Chin,David wrote:  External Email Warning This email originat

Re: [slurm-users] Managing Multiple Dependencies

2021-03-02 Thread Renfro, Michael
There may be prettier ways, but this gets the job done. Captures the output from each sbatch command to get a job ID, colon separates the ones in the second group, and removes the trailing colon before submitting the last job: #!/bin/bash JOB1=$(sbatch job1.sh | awk '{print $NF}') echo "Submitt

Re: [slurm-users] using resources effectively?

2020-12-16 Thread Renfro, Michael
We have overlapping partitions for GPU work and some kinds non-GPU work (both large memory and regular memory jobs). For 28-core nodes with 2 GPUs, we have: PartitionName=gpu MaxCPUsPerNode=16 … Nodes=gpunode[001-004] PartitionName=any-interactive MaxCPUsPerNode=12 … Nodes=node[001-040],gpunode

Re: [slurm-users] FairShare

2020-12-02 Thread Renfro, Michael
Yesterday, I posted https://docs.rc.fas.harvard.edu/kb/fairshare/

Re: [slurm-users] Doubts with Fairshare

2020-12-01 Thread Renfro, Michael
Harvard's Arts & Sciences Research Computing group has a good explanation of these columns at https://docs.rc.fas.harvard.edu/kb/fairshare/ -- might not answer your exact question, but it does go into how the FairShare column is calculated. From: slurm-users Date: Tuesday, December 1, 2020 at

Re: [slurm-users] sbatch overallocation

2020-10-10 Thread Renfro, Michael
I think the answer depends on why you’re trying to prevent the observed behavior: * Do you want to ensure that one job requesting 9 tasks (and 1 CPU per task) can’t overstep its reservation and take resources away from other jobs on those nodes? Cgroups [1] should be able to confine the jo

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Renfro, Michael
From any node you can run scontrol from, what does ‘scontrol show node GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and “CfgTRES=”. From: slurm-users on behalf of Sajesh Singh Reply-To: Slurm User Community List Date: Thursday, October 8, 2020 at 3:33 PM To: Slurm U

Re: [slurm-users] Auto-select partition?

2020-10-02 Thread Renfro, Michael
Depends on the version of Slurm. The docs for 17.11 [1] shows using packjob, and the docs for the current version (20.02 as of this writing) [2] shows using hetjob. It's really easy to wind up on documentation later than your running version, since only the top-level documentation page [3] menti

Re: [slurm-users] Simple free for all cluster

2020-10-02 Thread Renfro, Michael
Depending on the users who will be on this cluster, I'd probably adjust the partition to have a defined, non-infinite MaxTime, and maybe a lower DefaultTime. Otherwise, it would be very easy for someone to start a job that reserves all cores until the nodes get rebooted, since all they have to d

Re: [slurm-users] Running gpu and cpu jobs on the same node

2020-09-30 Thread Renfro, Michael
I could have missed a detail on my description, but we definitely don’t enable oversubscribe, or shared, or exclusiveuser. All three of those are set to “no” on all active queues. Current subset of slurm.conf and squeue output: = # egrep '^PartitionName=(gpu|any-interactive) ' /etc/slurm/s

Re: [slurm-users] Running gpu and cpu jobs on the same node

2020-09-30 Thread Renfro, Michael
We share our 28-core gpu nodes with non-gpu jobs through a set of ‘any’ partitions. The ‘any’ partitions have a setting of MaxCPUsPerNode=12, and the gpu partitions have a setting o MaxCPUsPerNode=16. That’s more or less documented in the slurm.conf documentation under “MaxCPUsPerNode”. From: s

Re: [slurm-users] Limit a partition or host to jobs less than 4 cores?

2020-09-30 Thread Renfro, Michael
Untested, but a combination of a QOS with MaxTRESPerJob=cpu=X and a partition that allows or denies that QOS may work. A job_submit.lua should be able to adjust the QOS of a submitted job, too. On 9/30/20, 10:50 AM, "slurm-users on behalf of Paul Edmon" wrote: External Email Warning

Re: [slurm-users] Mocking SLURM to debug job_submit.lua

2020-09-23 Thread Renfro, Michael
Not having a separate test environment, I put logic into my job_submit.lua to use either the production settings or the ones under development or testing, based off the UID of the user submitting the job: = function slurm_job_submit(job_desc, part_list, submit_uid) test_user_table = {}

Re: [slurm-users] Question/Clarification: Batch array multiple tasks on nodes

2020-09-01 Thread Renfro, Michael
We set DefMemPerCPU in each partition to approximately the amount of RAM in a node divided by the number of cores in the node. For heterogeneous partitions, we use a lower limit, and we always reserve a bit of RAM for the OS, too. So for a 64 GB node with 28 cores, we default to 2000 M per CPU,

Re: [slurm-users] Jobs getting StartTime 3 days in the future?

2020-08-31 Thread Renfro, Michael
One pending job in this partition should have a reason of “Resources”. That job has the highest priority, and if your job below would delay the highest-priority job’s start, it’ll get pushed back like you see here. On Aug 31, 2020, at 12:13 PM, Holtgrewe, Manuel wrote: Dear all, I'm seeing s

Re: [slurm-users] Adding Users to Slurm's Database

2020-08-18 Thread Renfro, Michael
The PowerShell script I use to provision new users adds them to an Active Directory group for HPC, ssh-es to the management node to do the sacctmgr changes, and emails the user. Never had it fail, and I've looped over entire class sections in PowerShell. Granted, there are some inherent delays d

Re: [slurm-users] scheduling issue

2020-08-14 Thread Renfro, Michael
We’ve run a similar setup since I moved to Slurm 3 years ago, with no issues. Could you share partition definitions from your slurm.conf? When you see a bunch of jobs pending, which ones have a reason of “Resources”? Those should be the next ones to run, and ones with a reason of “Priority” are

Re: [slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-08-07 Thread Renfro, Michael
I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= or COREs= settings. Currently, they’re: NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] COREs=0-7,9-15 and I’ve got 2 jobs currently running on each node that’s available. So maybe: NodeName=c0005

Re: [slurm-users] Correct way to give srun and sbatch different MaxTime values?

2020-08-04 Thread Renfro, Michael
Untested, but you should be able to use a job_submit.lua file to detect if the job was started with srun or sbatch: * Check with (job_desc.script == nil or job_desc.script == '') * Adjust job_desc.time_limit accordingly Here, I just gave people a shell function "hpcshell", which automati

Re: [slurm-users] Internet connection loss with srun to a node

2020-08-02 Thread Renfro, Michael
Probably unrelated to slurm entirely, and most likely has to do with lower-level network diagnostics. I can guarantee that it’s possible to access Internet resources from a compute node. Notes and things to check: 1. Both ping and http/https are IP protocols, but are very different (ping isn’t

Re: [slurm-users] slurm array with non-numeric index values

2020-07-15 Thread Renfro, Michael
If the 500 parameters happened to be filenames, you could do adapt like (appropriated from somewhere else, but I can’t find the reference quickly: = #!/bin/bash # get count of files in this directory NUMFILES=$(ls -1 *.inp | wc -l) # subtract 1 as we have to use zero-based indexing (first e

Re: [slurm-users] CPU allocation for the GPU jobs.

2020-07-13 Thread Renfro, Michael
“The SchedulerType configuration parameter specifies the scheduler plugin to use. Options are sched/backfill, which performs backfill scheduling, and sched/builtin, which attempts to schedule jobs in a strict priority order within each partition/queue.” https://slurm.schedmd.com/sched_config.ht

Re: [slurm-users] runtime priority

2020-06-30 Thread Renfro, Michael
There’s a --nice flag to sbatch and srun, at least. Documentation indicates it decreases priority by 100 by default. And untested, but it may be possible to use a job_submit.lua [1] to adjust nice values automatically. At least I can see a nice property in [2], which I assume means it'd be acce

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-16 Thread Renfro, Michael
Not trying to argue unnecessarily, but what you describe is not a universal rule, regardless of QOS. Our GPU nodes are members of 3 GPU-related partitions, 2 more resource-limited non-GPU partitions, and one of two larger-memory partitions. It’s set up this way to minimize idle resources (due t

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-15 Thread Renfro, Michael
ds Navin On Sat, Jun 13, 2020, 20:37 Renfro, Michael mailto:ren...@tntech.edu>> wrote: Will probably need more information to find a solution. To start, do you have separate partitions for GPU and non-GPU jobs? Do you have nodes without GPUs? On Jun 13, 2020, at 12:28 AM, navin srivas

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

2020-06-13 Thread Renfro, Michael
Will probably need more information to find a solution. To start, do you have separate partitions for GPU and non-GPU jobs? Do you have nodes without GPUs? On Jun 13, 2020, at 12:28 AM, navin srivastava wrote: Hi All, In our environment we have GPU. so what i found is if the user having high

Re: [slurm-users] Fairshare per-partition?

2020-06-12 Thread Renfro, Michael
I think that’s correct. From notes I’ve got for how we want to handle our fairshare in the future: Setting up a funded account (which can be assigned a fairshare): sacctmgr add account member1 Description="Member1 Description" FairShare=N Adding/removing a user to/from the funded accoun

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael
node with oversubscribe should be sufficient. > If you can't spare a single node then a VM would do the job. > > -Paul Edmon- > > On 6/11/2020 9:28 AM, Renfro, Michael wrote: >> That’s close to what we’re doing, but without dedicated nodes. We have three >> back-

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael
That’s close to what we’re doing, but without dedicated nodes. We have three back-end partitions (interactive, any-interactive, and gpu-interactive), but the users typically don’t have to consider that, due to our job_submit.lua plugin. All three partitions have a default of 2 hours, 1 core, 2

Re: [slurm-users] Slurm Job Count Credit system

2020-06-01 Thread Renfro, Michael
Even without the slurm-bank system, you can enforce a limit on resources with a QOS applied to those users. Something like: = sacctmgr add qos bank1 flags=NoDecay,DenyOnLimit sacctmgr modify qos bank1 set grptresmins=cpu=1000 sacctmgr add account bank1 sacctmgr modify account name=bank1 set

Re: [slurm-users] Ubuntu Cluster with Slurm

2020-05-13 Thread Renfro, Michael
I’d compare the RealMemory part of ’scontrol show node abhi-HP-EliteBook-840-G2’ to the RealMemory part of your slurm.conf: > Nodes which register to the system with less than the configured resources > (e.g. too little memory), will be placed in the "DOWN" state to avoid > scheduling jobs on t

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-09 Thread Renfro, Michael
restart. Thanks. > On May 8, 2020, at 11:47 AM, Renfro, Michael wrote: > > Working on something like that now. From an SQL export, I see 16 jobs from > my user that have a state of 7. Both states 3 and 7 show up as COMPLETED in > sacct, and may also have some duplicate job en

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
f,to,pr" > # Get Slurm individual job accounting records using the "sacct" command > sacct $partitionselect -n -X -a -S $start_time -E $end_time -o $FORMAT > -s $STATE > > There are numerous output fields which you can inquire, see "sacct -e". > > /Ole >

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
s that already completed, but still get counted against the user's current requests. From: Ole Holm Nielsen Sent: Friday, May 8, 2020 9:27 AM To: slurm-users@lists.schedmd.com Cc: Renfro, Michael Subject: Re: [slurm-users] scontrol show assoc_mgr showing m

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
e user's limits are printed in detail by showuserlimits. These tools are available from https://github.com/OleHolmNielsen/Slurm_tools /Ole On 08-05-2020 15:34, Renfro, Michael wrote: > Hey, folks. I've had a 1000 CPU-day (144 CPU-minutes) GrpTRESMins > limit applied to each

[slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
Hey, folks. I've had a 1000 CPU-day (144 CPU-minutes) GrpTRESMins limit applied to each user for years. It generally works as intended, but I have one user I've noticed whose usage is highly inflated from reality, causing the GrpTRESMins limit to be enforced much earlier than necessary: squ

Re: [slurm-users] Defining a default --nodes=1

2020-05-08 Thread Renfro, Michael
There are MinNodes and MaxNodes settings that can be defined for each partition listed in slurm.conf [1]. Set both to 1 and you should end up with the non-MPI partitions you want. [1] https://slurm.schedmd.com/slurm.conf.html From: slurm-users on behalf of Ho

Re: [slurm-users] how to restrict jobs

2020-05-06 Thread Renfro, Michael
in this case] > > Regards > Navin. > > > On Wed, May 6, 2020 at 7:47 PM Renfro, Michael wrote: > To make sure I’m reading this correctly, you have a software license that > lets you run jobs on up to 4 nodes at once, regardless of how many CPUs you > use? That is, y

Re: [slurm-users] how to restrict jobs

2020-05-06 Thread Renfro, Michael
specific > nodes? > i do not want to create a separate partition. > > is there any way to achieve this by any other method? > > Regards > Navin. > > > Regards > Navin. > > On Tue, May 5, 2020 at 7:46 PM Renfro, Michael wrote: > Haven’t done it yet

Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-05 Thread Renfro, Michael
Aside from any Slurm configuration, I’d recommend setting up a modules [1 or 2] folder structure for CUDA and other third-party software. That handles LD_LIBRARY_PATH and other similar variables, reduces the chances for library conflicts, and lets users decide their environment on a per-job basi

Re: [slurm-users] how to restrict jobs

2020-05-05 Thread Renfro, Michael
ically updated the value based on usage? > > > Regards > Navin. > > > On Tue, May 5, 2020 at 7:00 PM Renfro, Michael wrote: > Have you seen https://slurm.schedmd.com/licenses.html already? If the > software is just for use inside the cluster, one Licenses= line in s

Re: [slurm-users] how to restrict jobs

2020-05-05 Thread Renfro, Michael
Have you seen https://slurm.schedmd.com/licenses.html already? If the software is just for use inside the cluster, one Licenses= line in slurm.conf plus users submitting with the -L flag should suffice. Should be able to set that license value is 4 if it’s licensed per node and you can run up to

Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-04 Thread Renfro, Michael
Assuming you need a scheduler for whatever size your user population is, so they need literal JupyterHub, or would they all be satisfied running regular Jupyter notebooks? On May 4, 2020, at 7:25 PM, Lisa Kay Weihl wrote:  External Email Warning This email originated from outside the univer

Re: [slurm-users] one job at a time - how to set?

2020-04-30 Thread Renfro, Michael
#x27;d have to specify this when submitting, right? I.e. 'sbatch > --exclusive myjob.sh', if I understand correctly. Would there be a way to > simply enforce this, i.e. at the slurm.conf level or something? > > Thanks again! > > Rutger > > On Wed, Apr 29, 2020 at

Re: [slurm-users] one job at a time - how to set?

2020-04-29 Thread Renfro, Michael
That’s a *really* old version, but https://slurm.schedmd.com/archive/slurm-15.08.13/sbatch.html indicates there’s an exclusive flag you can set. On Apr 29, 2020, at 1:54 PM, Rutger Vos wrote: . Hi, for a smallish machine that has been having degraded performance we want to implement a pol

  1   2   >