[slurm-users] Enforcing -c and -t for fairshare scheduling and other setting

2022-05-13 Thread r
to prevent large jobs from being submitted and dispatched ahead of smaller jobs, and to further reward conserving resources. Many thanks, -R

Re: [slurm-users] slurm-users Digest, Vol 67, Issue 20

2023-05-17 Thread Sridhar R
Can you please remove my email id from your mailing list? I don't want these emails anymore. Thanks. On Wed, May 17, 2023 at 11:42 PM wrote: > Send slurm-users mailing list submissions to > slurm-users@lists.schedmd.com > > To subscribe or unsubscribe via the World Wide Web, visit >

Re: [slurm-users] x11 forwarding not available?

2018-10-15 Thread R. Paul Wiegand
I believe you also need: X11UseLocalhost no > On Oct 15, 2018, at 7:07 PM, Dave Botsch wrote: > > Hi. > > X11 forwarding is enabled and works for normal ssh. > > Thanks. > > On Mon, Oct 15, 2018 at 09:55:59PM +, Rhian Resnick wrote: >> >> >> Double check /etc/ssh/sshd_config allows X

Re: [slurm-users] Reserving a GPU

2018-10-22 Thread R. Paul Wiegand
I had the same question and put in a support ticket. I believe the answer is that you cannot. On Mon, Oct 22, 2018, 11:51 Christopher Benjamin Coffey < chris.cof...@nau.edu> wrote: > Hi, > > I can't figure out how one would create a reservation to reserve a gres > unit, such as a gpu. The man pa

Re: [slurm-users] how to find out why a job won't run?

2018-11-26 Thread R. Paul Wiegand
Steve, This doesn't really address your question, and I am guessing you are aware of this; however, since you did not mention it: "scontrol show job " will give you a lot of detail about a job (a lot more than squeue). It's "Reason" is the same as sinfo and squeue, though. So no help there. I'v

[slurm-users] Having a possible cgroup issue?

2018-12-06 Thread Anderson, Wes R
parameters #-- #### # W A R N I N G: This file is managed by Puppet # # - - - - - - - changes are likely to be overwritten # ### CgroupAut

[slurm-users] Simple question but I can't find the answer

2019-01-10 Thread Jeffrey R. Lang
Guys When I run sinfo some of the nodes in the list show there hostname with a following asterisk. I've looked through the man pages and what I can find on the web but nothing provides an answer. So what does the asterisk after the hostname mean? Jeff

Re: [slurm-users] Simple question but I can't find the answer

2019-01-10 Thread Jeffrey R. Lang
-UWYO address. Please exercise caution when clicking links or opening attachments from external sources. Is it following a host name, or a partition name? If the latter, it just means that it's the default partition. From: Jeffrey R. Lang <mailto:jrl...@u

[slurm-users] Why is this command not working

2019-01-16 Thread Jeffrey R. Lang
I'm trying to set a maxjobs limit on a specific user in my cluster, but following the example in the sacctmgr man page I keep getting this error. sacctmgr -v modify user where name=jrlang cluster=teton account=microbiome set maxjobs=30 sacctmgr: Accounting storage SLURMDBD plugin loaded with Au

Re: [slurm-users] Nodes remaining in drain state once job completes

2019-03-18 Thread Pawel R. Dziekonski
On 18/03/2019 23.07, Eric Rosenberg wrote: > [2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Kill task > failed This usually happens for me when one of the shared filesystems is overloadedand processes are stuck in uninterruptible sleep (D), thus unableto terminate. Your reason

Re: [slurm-users] Increasing job priority based on resources requested.

2019-04-21 Thread Pawel R. Dziekonski
Hi, you can always come up with some kind of submit "filter" that would assign constrains to jobs based on requested memory. In this way you can force smaller memory jobs to go only to low memory nodes and keep large memory nodes free from trash jobs. The disadvantage is that large mem nodes woul

[slurm-users] scontrol for a heterogenous job appears incorrect

2019-04-23 Thread Jeffrey R. Lang
I'm testing using heterogenous jobs for a user on out cluster, but seeing I think incorrect output from "scontrol show job XXX" for the job. The cluster is currently using Slurm 18.08. So my job script looks like this: #!/bin/sh ### This is a general SLURM script. You'll need to make modificat

Re: [slurm-users] scontrol for a heterogenous job appears incorrect

2019-04-24 Thread Jeffrey R. Lang
I was expecting. Jeff [jrlang@tlog1 TEST_CODE]$ sbatch check_nodes.sbatch Submitted batch job 2611773 [jrlang@tlog1 TEST_CODE]$ squeue | grep jrlang 2611773+1 teton CHECK_NO jrlang R 0:10 9 t[439-447] 2611773+0 teton-hug CHECK_NO jrlang R 0:10

Re: [slurm-users] Slurm database failure messages

2019-05-07 Thread Pawel R. Dziekonski
On 07/05/2019 13.47, David Baker wrote: > We are experiencing quite a number of database failures. > [root@blue51 slurm]#*less slurmdbd.log-20190506.gz | grep failed* > [2019-05-05T04:00:05.603] error: mysql_query failed: 1213 Deadlock found when > trying to get lock; try restarting transaction

[slurm-users] question about partition definition

2019-12-09 Thread Jeffrey R. Lang
I need to set up a partition that limits the number of jobs allowed to run at one time. Looking at the slurm.conf page for partition definitions I don't see a MaxJobs option. Is there a way to limit the number of jobs in a partition? Thanks, Jeff

[slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-02-20 Thread Nathan R Crawford
Hi All, I have 19.05.4 and want to change SelectType from select/cons_res to select/cons_tres without losing running or pending jobs. The documentation is a bit conflicting. >From the man page: SelectType Identifies the type of resource selection algorithm to be used. Changing this value can

Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-02-21 Thread Nathan R Crawford
Hi Chris, If it just requires restarting slurmctld and the slurmd processes on the nodes, I will be happy! Can you confirm that no running or pending jobs were lost in the transition? Thanks, Nate On Thu, Feb 20, 2020 at 6:54 PM Chris Samuel wrote: > On 20/2/20 2:16 pm, Nathan R Crawf

[slurm-users] Question about determining pre-empted jobs

2020-02-28 Thread Jeffrey R. Lang
I need your help. We have had a request to generate a report showing the number of jobs by date showing pre-empted jobs. We used sacct to try to gather the data but we only found a few jobs with the state "PREEMPTED". Scanning the slurmd logs we find there are a lot of job that show pre-empte

Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-03-29 Thread Nathan R Crawford
> > > > Hi Nate, > > > > On Fri, 2020-02-21 at 11:38 -0800, Nathan R Crawford wrote: > > > If it just requires restarting slurmctld and the slurmd processes > > > on the nodes, I will be happy! Can you confirm that no running or > > > pending jobs

[slurm-users] SLURM 20.11.0 no x11 forwarding.

2021-04-22 Thread Luis R. Torres
1Forwarding yes X11DisplayOffset 10 X11UseLocalhost no Our cluster is configured with SlurmUser=slurm, not root. Thanks, -- ---- Luis R. Torres

Re: [slurm-users] SLURM 20.11.0 no x11 forwarding.

2021-04-23 Thread Luis R. Torres
I believe that was the case, we compiled it with x11 support, however, further debugging suggests that there's an issue writing to the .Xauthority file when using forwarding through srun.

[slurm-users] Exposing only requested CPUs to a job on a given node.

2021-05-14 Thread Luis R. Torres
CPUs".format(multiprocessing.c pu_count())) for i in range(multiprocessing.cpu_count()): p = multiprocessing.Process(target=worker, name=i).star t() Thanks, -- ---- Luis R. Torres

Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-07-01 Thread Luis R. Torres
ist: *{0, 1, 2, 10, 11, 12}* = On Fri, May 14, 2021 at 1:35 PM Luis R. Torres wrote: > Hi Folks, > > We are currently running on SLURM 20.11.6 with cgroups constraints for > memory and CPU/Core. Can the scheduler only expose the requested number o

Re: [slurm-users] How to avoid a feature?

2021-07-02 Thread Jeffrey R. Lang
How about using node weights.Weight the non-gpu nodes so that they are scheduled first. The GPU nodes could have a very high weight so that the scheduler would consider them last for allocation. This would allow the non-gpu nodes to be filled first and when full schedule the GPU nodes. Us

[slurm-users] Assigning two "cores" when I'm only request one.

2021-07-12 Thread Luis R. Torres
s/Cores: 2 Affinity List: {0, 10} = -- ---- Luis R. Torres

[slurm-users] big increase of MaxStepCount?

2022-01-12 Thread John R Anderson
s on this? can this successfully be applied to a partition or individual nodes only? i wonder about log files exploding or worse... thanks! [University of Nevada, Reno]<http://www.unr.edu/> John R. Anderson High-Performance Computing Engineer Office of Information Technology University

Re: [slurm-users] Fwd: useradd: group 'slurm' does not exist

2022-01-25 Thread Jeffrey R. Lang
Looking at what you provided in your email the groupadd commands are failing, due to the requested GID 991 and 992 already being assigned by the system your installing on. Check the /etc/group file and find two GID numbers lower than 991 that are unused and use those instead. Keep them in the

Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-27 Thread Jeffrey R. Lang
The missing file error has nothing to do with slurm. The systemctl command is part of the systems service management. The error message indicates that you haven’t copied the slurmd.service file on your compute node to /etc/systemd/system or /usr/lib/systemd/system. /etc/systemd/system is usua

[slurm-users] Where is the documentation for saving batch script

2022-03-17 Thread Jeffrey R. Lang
Hello I want to look into the new feature of saving job scripts in the Slurm database but have been unable to find documentation on how to do it. Can someone please point me in the right direction for the documentation or slurm configuration changes that need to be implemented? Thanks jeff

[slurm-users] Help with failing job execution

2022-03-24 Thread Jeffrey R. Lang
My site recently updated to Slurm 21.08.6 and for the most part everything went fine. Two Ubuntu nodes however are having issues.Slurmd cannot execve the jobs on the nodes. As an example: [jrlang@tmgt1 ~]$ salloc -A ARCC --nodes=1 --ntasks=20 -t 1:00:00 --bell --nodelist=mdgx01 --partitio

[slurm-users] How to open a slurm support case

2022-03-24 Thread Jeffrey R. Lang
Can someone provide me with instructions on how to open a support case with SchedMD? We have a support contract, but no where on their website can I find a link to open a case with them. Thanks, Jeff

[slurm-users] Preempt jobs to stay within account TRES limits?

2022-10-21 Thread Matthew R. Baney
Hello, I have noticed that jobs submitted to non-preemptable partitions (PreemptType = preempt/partition_prio and PreemptMode = REQUEUE) under accounts with GrpTRES limits will become pending with AssocGrpGRES as the reason when the account is up against the relevant limit, even when there are oth

Re: [slurm-users] Per-user TRES summary?

2022-11-28 Thread Jeffrey R. Lang
TRESPU -- -- normalcpu=80,mem=320G But to work out what a user is currently using in currently running jobs, the nearest I can work out is: % sacct -X -s R --units=G -o User,ReqTRES%50 UserR

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jeffrey R. Lang
The service is available in RHEL 8 via the EPEL package repository as system-networkd, i.e. systemd-networkd.x86_64 253.4-1.el8epel -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: Monday, October 30, 2023 1:56 PM T

[slurm-users] Cleanup of old clusters in database

2024-01-10 Thread Jeffrey R. Lang
We have shuttered two clusters and need to remove them from the database. To do this, do we remove the table spaces associated with the cluster names from the Slurm database? Thanks, Jeff

Re: [slurm-users] PMIx and Slurm

2017-11-28 Thread r...@open-mpi.org
Very true - one of the risks with installing from packages. However, be aware that slurm 17.02 doesn’t support PMIx v2.0, and so this combination isn’t going to work anyway. If you want PMIx v2.x, then you need to pair it with SLURM 17.11. Ralph > On Nov 28, 2017, at 2:32 PM, Philip Kovacs wr

Re: [slurm-users] PMIx and Slurm

2017-11-28 Thread r...@open-mpi.org
My apologies - I guess we hadn’t been tracking it that way. I’ll try to add some clarification. We presented a nice table at the BoF and I just need to find a few minutes to post it. I believe you do have to build slurm against PMIx so that the pmix plugin is compiled. You then also have to spe

Re: [slurm-users] PMIx and Slurm

2017-11-28 Thread r...@open-mpi.org
mix.so library. If you favor using the pmix versions of > pmi/pmi2, sounds like you'll get better performance > when using pmi/pmi2, but as mentioned, you would want to test every mpi > variant listed to make sure everything works. > > > On Tuesday, November 28, 2017 9:

Re: [slurm-users] OpenMPI & Slurm: mpiexec/mpirun vs. srun

2017-12-18 Thread r...@open-mpi.org
Repeated here from the OMPI list: We have had reports of applications running faster when executing under OMPI’s mpiexec versus when started by srun. Reasons aren’t entirely clear, but are likely related to differences in mapping/binding options (OMPI provides a very large range compared to sru

Re: [slurm-users] OpenMPI & Slurm: mpiexec/mpirun vs. srun

2017-12-18 Thread r...@open-mpi.org
. > On Dec 18, 2017, at 5:23 PM, Christopher Samuel wrote: > > On 19/12/17 12:13, r...@open-mpi.org wrote: > >> We have had reports of applications running faster when executing under >> OMPI’s mpiexec versus when started by srun. > > Interesting, I know that

Re: [slurm-users] [17.11.1] no good pmi intention goes unpunished

2017-12-20 Thread r...@open-mpi.org
On Dec 20, 2017, at 6:21 PM, Philip Kovacs wrote: > > > -- slurm.spec: move libpmi to a separate package to solve a conflict with > > the > >version provided by PMIx. This will require a separate change to PMIx as > >well. > > I see the intention behind this change since the pmix 2.0+

Re: [slurm-users] [17.11.1] no good pmi intention goes unpunished

2017-12-21 Thread r...@open-mpi.org
2 code since it is compiled > directly into the plugin. > > > On Wednesday, December 20, 2017 10:47 PM, "r...@open-mpi.org" > wrote: > > > On Dec 20, 2017, at 6:21 PM, Philip Kovacs <mailto:pkde...@yahoo.com>> wrote: >> >> > --

Re: [slurm-users] [17.11.1] no good pmi intention goes unpunished

2017-12-21 Thread r...@open-mpi.org
s wrote: > > >(they are nothing more than symlinks to libpmix) > > This is very helpful to know. > > > On Thursday, December 21, 2017 3:28 PM, "r...@open-mpi.org" > wrote: > > > Hmmm - I think there may be something a little more subtle here. If you bui

[slurm-users] Using PMIx with SLURM

2018-01-03 Thread r...@open-mpi.org
Hi folks There have been some recent questions on both this and the OpenMPI mailing lists about PMIx use with SLURM. I have tried to capture the various conversations in a “how-to” guide on the PMIx web site: https://pmix.org/support/how-to/slurm-support/

[slurm-users] Fabric manager interactions: request for comments

2018-02-05 Thread r...@open-mpi.org
I apologize in advance if you received a copy of this from other mailing lists -- Hello all The PMIx community is starting work on the next phase of defining support for network interactions, looking specifically at things we might want to obtain and/or control v

Re: [slurm-users] Allocate more memory

2018-02-07 Thread r...@open-mpi.org
I’m afraid neither of those versions is going to solve the problem here - there is no way to allocate memory across nodes. Simple reason: there is no way for a process to directly address memory on a separate node - you’d have to implement that via MPI or shmem or some other library. > On Feb

Re: [slurm-users] Allocate more memory

2018-02-07 Thread r...@open-mpi.org
st 3G total memory ? even though my nodes were setup with 2G each > ?? > > #SBATCH array 1-10%10:1 > > #SBATCH mem-per-cpu=3000m > > srun R CMD BATCH myscript.R > > > > thanks > > > > > On 07/02/2018 15:50, Loris Bennett wrote: >> H

[slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Greetings, I am setting up our new GPU cluster, and I seem to have a problem configuring things so that the devices are properly walled off via cgroups. Our nodes each of two GPUS; however, if --gres is unset, or set to --gres=gpu:0, I can access both GPUs from inside a job. Moreover, if I ask fo

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
yours > ... > /dev/nvidia* > > There was a SLURM bug issue that made this clear, not so much in the > website docs. > > -Kevin > > > On 5/1/18, 5:28 PM, "slurm-users on behalf of R. Paul Wiegand" < > slurm-users-boun...@lists.schedmd.com on behalf of rpwi

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Thanks Chris. I do have the ConstrainDevices turned on. Are the differences in your cgroup_allowed_devices_file.conf relevant in this case? On Tue, May 1, 2018, 19:23 Christopher Samuel wrote: > On 02/05/18 09:00, Kevin Manalo wrote: > > > Also, I recall appending this to the bottom of > > > >

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Slurm 17.11.0 on CentOS 7.1 On Tue, May 1, 2018, 19:26 Christopher Samuel wrote: > On 02/05/18 09:23, R. Paul Wiegand wrote: > > > I thought including the /dev/nvidia* would whitelist those devices > > ... which seems to be the opposite of what I want, no? Or do I > >

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
pgrade. Should I just wait and test after the upgrade? On Tue, May 1, 2018, 19:56 Christopher Samuel wrote: > On 02/05/18 09:31, R. Paul Wiegand wrote: > > > Slurm 17.11.0 on CentOS 7.1 > > That's quite old (on both fronts, RHEL 7.1 is from 2015), we started on > that same

Re: [slurm-users] GPU / cgroup challenges

2018-05-02 Thread R. Paul Wiegand
manager. On Tue, May 1, 2018 at 8:29 PM, Christopher Samuel wrote: > On 02/05/18 10:15, R. Paul Wiegand wrote: > >> Yes, I am sure they are all the same. Typically, I just scontrol >> reconfig; however, I have also tried restarting all daemons. > > > Understood. Any

Re: [slurm-users] GPU / cgroup challenges

2018-05-21 Thread R. Paul Wiegand
uel wrote: > > On Wednesday, 2 May 2018 11:04:34 PM AEST R. Paul Wiegand wrote: > >> When I set "--gres=gpu:1", the slurmd log does have encouraging lines such >> as: >> >> [2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device >> /dev

[slurm-users] Noob slurm question

2018-12-12 Thread Merritt, Todd R - (tmerritt)
Hi all, I'm new to slurm. I've used PBS extensively and have set up an accounting system that gives groups/account a fixed number of hours per month on a per queue/partition basis. It decrements that time allocation with every job run and then resets it to the original value at t

Re: [slurm-users] Noob slurm question

2018-12-12 Thread Merritt, Todd R - (tmerritt)
n.org/pod/Slurm::Sshare>) which include a basic sbalance command script. You would likely need to modify the script for your situation (it assumes a situation more like the first example above), but that should not be too bad. On Wed, Dec 12, 2018 at 1:58 PM Merritt, Todd R - (tmerritt)

[slurm-users] Enforcing relative resource restrictions in submission script

2024-02-27 Thread Matthew R. Baney via slurm-users
Hello Slurm users, I'm trying to write a check in our job_submit.lua script that enforces relative resource requirements such as disallowing more than 4 CPUs or 48GB of memory per GPU. The QOS itself has a MaxTRESPerJob of cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to pr

[slurm-users] Re: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison The error message indicates that there are no resources to execute jobs. Since you haven’t defined any compute nodes you will get this error. I would suggest that you create at least one compute node. Once, you do that this error should go away. Jeff From: Alison Peterson via slurm-

[slurm-users] Re: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison Can you provide the output of the following commands: * sinfo * scontrol show node name=head and the job command that your trying to run? From: Alison Peterson Sent: Tuesday, April 9, 2024 3:03 PM To: Jeffrey R. Lang Cc: slurm-users@lists.schedmd.com Subject: Re: [EXT] RE

[slurm-users] Re: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
. I need to see what’s in the test.sh file to get an idea of how your job is setup. jeff From: Alison Peterson Sent: Tuesday, April 9, 2024 3:15 PM To: Jeffrey R. Lang Cc: slurm-users@lists.schedmd.com Subject: Re: [EXT] RE: [EXT] RE: [slurm-users] Nodes required for job are down, drained or

[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
use scontrol update node=head state=resume and then check the status again. Hopwfully the node with show idle meaning that it’s should be ready to accept jobs. Jeff From: Alison Peterson Sent: Tuesday, April 9, 2024 3:40 PM To: Jeffrey R. Lang Cc: slurm-users

[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison I’m glad I was able to help. Good luck. Jeff From: Alison Peterson Sent: Tuesday, April 9, 2024 4:09 PM To: Jeffrey R. Lang Cc: slurm-users@lists.schedmd.com Subject: Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: [slurm-users] Nodes required for job are down, drained or reserved

[slurm-users] Optimizing CPU socket affinities and NVLink

2024-08-08 Thread Matthew R. Baney via slurm-users
Hello, I've recently adopted setting AutoDetect=nvml in our GPU nodes' gres.conf files to automatically populate Cores and Links for GPUs, which has been working well. I'm now wondering if I can prioritize having single GPU jobs scheduled on NVLink pairs (these are PCIe A6000s) where one of the G

[slurm-users] Node configuration unavailable when using --mem-per-gpu , for specific GPU type

2024-12-13 Thread Matthew R. Baney via slurm-users
Hi all, I'm seeing some odd behavior when using the --mem-per-gpu flag instead of the --mem flag to request memory when also requesting all available CPUs on a node (in this example, all available nodes have 32 CPUs): $ srun --ntasks-per-node=8 --cpus-per-task=4 --gpus-per-node=gtx1080ti:1 --mem-

[slurm-users] Re: Slurm not running on a warewulf node

2024-12-03 Thread Jeffrey R. Lang via slurm-users
Steve Trying running the failing process from the command line and use the -D option. Per man page: Run slurmd in the foreground. Error and debug messages will be copied to stderr. Jeffrey R. Lang Advanced Research Computing Center University of Wyoming, Information Technology Center 1000 E