[slurm-users] Enforcing -c and -t for fairshare scheduling and other setting

2022-05-13 Thread r
Hi,

We've deployed a Slurm cluster and it works well. However, I would like to
encourage users to conserve resources and to distribute jobs more fairly.

Below are some ideas I'd like to implement, please let me know if they are
feasible and, if so, point me in a correct direction. Or let me know if
there are better ways of achieving the above goal.

I would like to:
- Require users to specify -c and -t options. That is, to reject any jobs
that do not specify these options. Optionally also --mem but that is of low
priority to us.
- Forbid use of --cpu-bind=no or treat it as -c 64.
- Set up a fairshare scheduler and assign weight to values specified via -c
and -t
- Enforce resource limits specified via -c, -t and -mem (-t and -c already
work, at least without --cpu-bind=no)
- Either limit the overall number of CPU slots per partition or test for
availability of licences before jobs are released from the queue. This is
to prevent jobs from waiting for licenses at run time and potentially get
killed when -t timeout is exceeded.
- Ideally, force jobs to queue for a certain period of time (a small
fraction of -c * -t) even if partition has available resources left. This
is to prevent large jobs from being submitted and dispatched ahead of
smaller jobs, and to further reward conserving resources.

Many thanks,
-R


Re: [slurm-users] slurm-users Digest, Vol 67, Issue 20

2023-05-17 Thread Sridhar R
Can you please remove my email id from your mailing list? I don't want
these emails anymore. Thanks.

On Wed, May 17, 2023 at 11:42 PM 
wrote:

> Send slurm-users mailing list submissions to
> slurm-users@lists.schedmd.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
> or, via email, send a message with subject or body 'help' to
> slurm-users-requ...@lists.schedmd.com
>
> You can reach the person managing the list at
> slurm-users-ow...@lists.schedmd.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of slurm-users digest..."
>
>
> Today's Topics:
>
>1. Re: On the ability of coordinators (Renfro, Michael)
>
>
> --
>
> Message: 1
> Date: Wed, 17 May 2023 18:11:49 +
> From: "Renfro, Michael" 
> To: Slurm User Community List 
> Subject: Re: [slurm-users] On the ability of coordinators
> Message-ID: <33d4bf81-8025-4f08-80da-83b578175...@tntech.edu>
> Content-Type: text/plain; charset="utf-8"
>
> If there?s a fairshare component to job priorities, and there?s a share
> assigned to each user under the account, wouldn?t the light user?s jobs
> move ahead of any of the heavy user?s pending jobs automatically?
>
> From: slurm-users  on behalf of
> "Groner, Rob" 
> Reply-To: Slurm User Community List 
> Date: Wednesday, May 17, 2023 at 1:09 PM
> To: "slurm-users@lists.schedmd.com" 
> Subject: Re: [slurm-users] On the ability of coordinators
>
>
> External Email Warning
>
> This email originated from outside the university. Please use caution when
> opening attachments, clicking links, or responding to requests.
>
> 
> Ya, I found they had the power to hold jobs just be experimentation.
> Maybe it will turn out I had something misconfigured and coordinators don't
> have that ability either.  I hope that's not the case, since being able to
> hold jobs in their account gives them some usefulness.
>
> My interest in this was solely focused on what coordinators could do to
> jobs within their account.  So, I accepted as ok that a coordinator
> couldn't move jobs in their account to a higher priority than jobs in other
> accounts.  I just wanted the coordinator to be able to move jobs in their
> account to a higher priority over other jobs within the same account.
> Being able to use hold/release seems like what we're looking for.  I just
> wonder why coordinators can't use "top" as well, for jobs within their
> coordinated account.  I guess "top" is meant to move them to the top of the
> entire pending queue, and in my case, I was only interested in the
> coordinator moving certain jobs in their accounts to the top of the
> account-related queue.  But of course, there ISN'T an account-related
> queue, so maybe that's why top doesn't work for a coordinator.  I think I
> just answered my own question.
>
> 
> From: slurm-users  on behalf of
> Brian Andrus 
> Sent: Wednesday, May 17, 2023 2:00 PM
> To: slurm-users@lists.schedmd.com 
> Subject: Re: [slurm-users] On the ability of coordinators
>
>
> Coordinator permissions from the man pages:
>
> coordinator
> A special privileged user, usually an account manager, that can add users
> or sub-accounts to the account they are coordinator over. This should be a
> trusted person since they can change limits on account and user
> associations, as well as cancel, requeue or reassign accounts of jobs
> inside their realm.
>
> So, I read that as it manages accounts in slurmdb with minimal access to
> the jobs themselves. So you would be stuck with cancel/requeue. I see no
> mention of hold, but if that is one of the permissions, I would say, yes,
> our approach does what you want within the limits of what the default
> permissions of a coordinator can do.
>
>
>
> Of course, that still may not work if there are other
> accounts/partitions/users with higher priority jobs than User B.
> Specifically if those jobs can use the same resources A's jobs are running
> on.
>
>
>
> Brian Andrus
>
>
> On 5/17/2023 10:49 AM, Groner, Rob wrote:
> I'm not sure what you mean by "if they have the permissions".  I'm talking
> about someone who is specifically designated as "coordinator" of an account
> in slurm.  With that designation, and no other admin level changes, I'm not
> aware that they can directly change the priority of jobs associated with
> the account.
>
> If you're talking about additional permissions or admin levels...we're not
> looking into that as an option.  We want to purely use the coordinator role
> to have them manipulate stuff.
>
> 
> From: slurm-users  slurm-users-boun...@lists.schedmd.com> on behalf of Brian Andrus <
> toomuc...@gmail.com>
> Sent: Wednesday, May 17, 2023 12:58 PM
> To: slurm-users@lists.schedmd.com

Re: [slurm-users] x11 forwarding not available?

2018-10-15 Thread R. Paul Wiegand
I believe you also need:

X11UseLocalhost no



> On Oct 15, 2018, at 7:07 PM, Dave Botsch  wrote:
> 
> Hi.
> 
> X11 forwarding is enabled and works for normal ssh.
> 
> Thanks.
> 
> On Mon, Oct 15, 2018 at 09:55:59PM +, Rhian Resnick wrote:
>> 
>> 
>> Double check /etc/ssh/sshd_config allows X11 forwarding on the node as it is 
>> disable by default. (I think)
>> 
>> 
>> X11Forwarding yes
>> 
>> 
>> 
>> 
>> Rhian Resnick
>> 
>> Associate Director Research Computing
>> 
>> Enterprise Systems
>> 
>> Office of Information Technology
>> 
>> 
>> Florida Atlantic University
>> 
>> 777 Glades Road, CM22, Rm 173B
>> 
>> Boca Raton, FL 33431
>> 
>> Phone 561.297.2647
>> 
>> Fax 561.297.0222
>> 
>> [image] 
>> 
>> 
>> 
>> From: slurm-users  on behalf of Dave 
>> Botsch 
>> Sent: Monday, October 15, 2018 5:51 PM
>> To: slurm-users@lists.schedmd.com
>> Subject: [slurm-users] x11 forwarding not available?
>> 
>> 
>> 
>> Wanted to test X11 forwarding. X11 forwarding works as a normal user
>> just ssh'ing to a node and running xterm/etc.
>> 
>> With srun, however:
>> 
>> srun -n1 --pty --x11 xterm
>> srun: error: Unable to allocate resources: X11 forwarding not available
>> 
>> So, what am I missing?
>> 
>> Thanks.
>> 
>> PS
>> 
>> srun --version
>> slurm 17.11.7
>> 
>> rpm -qa |grep slurm
>> ohpc-slurm-server-1.3.5-8.1.x86_64
>> ...
>> 
>> 
>> --
>> 
>> David William Botsch
>> Programmer/Analyst
>> @CNFComputing
>> bot...@cnf.cornell.edu
>> 
>> --
>> 
>> David William Botsch
>> Programmer/Analyst
>> @CNFComputing
>> bot...@cnf.cornell.edu
>> 
>> 
> 
> -- 
> 
> David William Botsch
> Programmer/Analyst
> @CNFComputing
> bot...@cnf.cornell.edu
> 




Re: [slurm-users] Reserving a GPU

2018-10-22 Thread R. Paul Wiegand
I had the same question and put in a support ticket.  I believe the answer
is that you cannot.

On Mon, Oct 22, 2018, 11:51 Christopher Benjamin Coffey <
chris.cof...@nau.edu> wrote:

> Hi,
>
> I can't figure out how one would create a reservation to reserve a gres
> unit, such as a gpu. The man page doesn't really say that gres is supported
> for a reservation, but it does say tres is supported. Yet, I can't seem to
> figure out how one could specify a gpu with tres. I've tried:
>
>
> scontrol create reservation starttime=2018-11-10T08:00:00 user=root
> duration=14-00:00:00 tres=gpu/tesla=1
> scontrol create reservation starttime=2018-11-10T08:00:00 user=root
> duration=14-00:00:00 gres=gpu:tesla:1
>
> Is it not possible to reserve a single gpu? Our gpus are requested
> normally in jobs like " --gres=gpu:tesla:1". I'd rather not reserve an
> entire node and thus reserve all of the gpus in it. Thanks!
>
> Best,
> Chris
>
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>
>
>


Re: [slurm-users] how to find out why a job won't run?

2018-11-26 Thread R. Paul Wiegand
Steve,

This doesn't really address your question, and I am guessing you are
aware of this; however, since you did not mention it:  "scontrol show
job " will give you a lot of detail about a job (a lot more
than squeue).  It's "Reason" is the same as sinfo and squeue, though.
So no help there.  I've always found that it is a bit of a detective
exercise.  In the end, though, there's always a reason.  It's just
sometimes very subtle.  For example, we use "Features" so that users
can constrain their jobs based on various factors (e.g., CPU
architecture), and we'll sometimes have users ask for something like a
"Haswell" processor and 190 GB of memory ... but we only have that
much on our Skylake machines.  So the "reason" can be very non-linear.

Sadly, I don't know of an easy tool that just looks at all the data
and tells you or gives you better clues.  I agree that that would be
very helpful.


As to preemptable, do you have "checkpoint" enabled via SLURM?  There
are situations in which a SLURM-checkpointed job will still occupy
some memory, and a pending job cannot deploy because that memory is in
use, even though the job was suspended.  Perhaps someone on the list
with more experience using the preemptable partitions/QoS *WITH* the
SLURM checkpointing flag enabled could speak to this?  As Steve knows,
we just cancel the job when it is preempted.

Paul.

On Mon, Nov 26, 2018 at 3:22 AM Daan van Rossum  wrote:
>
> I'm also interested in this.  Another example: "Reason=(ReqNodeNotAvail)" is 
> all that a user sees in a situation when his/her job's walltime runs into a 
> system maintenance reservation.
>
> * on Friday, 2018-11-23 09:55 -0500, Steven Dick  wrote:
>
> > I'm looking for a tool that will tell me why a specific job in the
> > queue is still waiting to run.  squeue doesn't give enough detail.  If
> > the job is held up on QOS, it's pretty obvious.  But if it's
> > resources, it's difficult to tell.
> >
> > If a job is not running because of resources, how can I identify which
> > resource is not available?  In a few cases, I've looked at what the
> > job asked for and found a node that has those resources free, but
> > still can't figure out why it isn't running.
> >
> > Also, if there are preemptable jobs in the queue, why is the job
> > waiting on resources?  Is there a priority for running jobs that can
> > be compared to waiting jobs?



[slurm-users] Having a possible cgroup issue?

2018-12-06 Thread Anderson, Wes R
PERSONAL/NONWORK // EXTERNAL
I took a look through the archives, and I did not see an clear answer to the 
issue I was seeing, so I thought I would go ahead and ask.

I am having a cluster issue with SLURM and I hoped you might be able to help me 
out. I built a small test cluster to determine if it might meet some compute 
needs I have but seem to keep running into an issue where SLURM is restricting 
MATLAB to using a single CPU regardless of how many we request.

During testing I found the following:

When I login into a MATLAB interactive session and run "feature numcores"

I get the following:

[cid:image002.jpg@01D48D35.180C4F30]

Which is correct, as I have 14 cores and they are all available.


However when I go into SLURM and request a MATLAB interactive session and run 
the same command on the same computer:

[cid:image004.png@01D48D35.180C4F30]


So, what I understand is that my cgroups settings in SLURM are restricting 
MATLAB to a single core. Is that correct?  Also, how do I fix this?

Here is my cgroups.conf

###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--

####
# W A R N I N G:  This file is managed by Puppet   #
# - - - - - - -   changes are likely to be overwritten #


###
CgroupAutomount=yes
###

# testing -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
#ConstrainCores=no
#ConstrainRAMSpace=no
#ConstrainSwapSpace=no
# testing -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

#
ConstrainDevices=no

AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
##MinRAMSpace=30

# TASK/CGROUP PLUGIN

# Constrain the job cgroup RAM to this percentage of the allocated memory.
#AllowedRAMSpace=10
AllowedRamSpace=100

# TaskAffinity=
#  If configured to "yes" then set a default task affinity to bind each
#  step task to a subset of the allocated cores using
#  sched_setaffinity. The default value is "no". Note: This feature
#  requires the Portable Hardware Locality (hwloc) library to be
#  installed.
TaskAffinity=yes

# MemorySwappiness=
# Configure the kernel's priority for swapping out anonymous pages (such as 
program data)
# verses file cache pages for the job cgroup. Valid values are between 0 and 
100, inclusive. A
# value of 0 prevents the kernel from swapping out program data. A value of 100 
gives equal
# priorioty to swapping out file cache or anonymous pages. If not set, then the 
kernel's default
# swappiness value will be used. Either ConstrainRAMSpace or ConstrainSwapSpace 
must
# be set to yes in order for this parameter to be applied.

MemorySwappiness=0

#
# If compute nodes mount Lustre or NFS file systems, it may be a good idea to   
  #
#  configure cgroup.conf with:  
  #
# ConstrainKmemSpace=no 
   #
#   

  #
# From 
<https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#activating-cgroups>  #
#
ConstrainKmemSpace=no   
 #

####
# W A R N I N G:  This file is managed by Puppet   #
# - - - - - - -   changes are likely to be overwritten #


Thanks,
Wes
(A slurm neophyte)


[slurm-users] Simple question but I can't find the answer

2019-01-10 Thread Jeffrey R. Lang
Guys

When I run sinfo some of the nodes in the list show there hostname with a 
following asterisk.  I've looked through the man pages and what I can find on 
the web but nothing provides an answer.

So what does the asterisk after the hostname mean?


Jeff


Re: [slurm-users] Simple question but I can't find the answer

2019-01-10 Thread Jeffrey R. Lang
Yes, I missed the mark here, yes it is after the partition.


From: slurm-users  On Behalf Of Andy 
Riebs
Sent: Thursday, January 10, 2019 10:22 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Simple question but I can't find the answer

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.

Is it following a host name, or a partition name? If the latter, it just means 
that it's the default partition.

From: Jeffrey R. Lang <mailto:jrl...@uwyo.edu>
Sent: Thursday, January 10, 2019 11:13AM
To: Slurm-users <mailto:slurm-us...@schedmd.com>
Cc:
Subject: [slurm-users] Simple question but I can't find the answer
Guys

When I run sinfo some of the nodes in the list show there hostname with a 
following asterisk.  I’ve looked through the man pages and what I can find on 
the web but nothing provides an answer.

So what does the asterisk after the hostname mean?


Jeff



[slurm-users] Why is this command not working

2019-01-16 Thread Jeffrey R. Lang
I'm trying to set a maxjobs limit on a specific user in my cluster, but 
following the example in the sacctmgr man page I keep getting this error.


sacctmgr -v modify user where name=jrlang cluster=teton account=microbiome set 
maxjobs=30
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
Nothing modified

I then check the results of the change with:

sacctmgr show assoc where cluster=teton  account=microbiome tree

   Cluster  Account   User  Partition Share GrpJobs   
GrpTRES GrpSubmit GrpWall   GrpTRESMins MaxJobs   MaxTRES 
MaxTRESPerNode MaxSubmit MaxWall   MaxTRESMins  QOS   Def 
QOS GrpTRESRunMin
--  -- -- - --- 
- - --- - --- - 
-- - --- -  
- -
 teton microbiome 1 

  microbiome,normal microbio+
 teton  microbiomejrlang1

But the max job setting has not been modified.

Can some point out the error of what I'm doing wrong?



Re: [slurm-users] Nodes remaining in drain state once job completes

2019-03-18 Thread Pawel R. Dziekonski
On 18/03/2019 23.07, Eric Rosenberg wrote:
>  [2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Kill task 
> failed

This usually happens for me when one of the shared filesystems
is overloadedand processes are stuck in uninterruptible sleep
(D), thus unableto terminate.

Your reason can be different.

HTH, P

-- 
Dr. Pawel Dziekonski 
KAUST Advanced Computing Core Laboratory
https://www.hpc.kaust.edu.sa




Re: [slurm-users] Increasing job priority based on resources requested.

2019-04-21 Thread Pawel R. Dziekonski
Hi,

you can always come up with some kind of submit "filter" that would
assign constrains to jobs based on requested memory. In this way you
can force smaller memory jobs to go only to low memory nodes and keep
large memory nodes free from trash jobs.

The disadvantage is that large mem nodes would wait idle if only low
mem jobs are in the queue.

cheers,
P







>> - Original Message -
>> From: "Prentice Bisbal" 
>> To: slurm-users@lists.schedmd.com
>> Sent: Friday, April 19, 2019 11:27:08 AM
>> Subject: Re: [slurm-users] Increasing job priority based on resources 
>> requested.
>>
>> Ryan,
>>
>> I certainly understand your point of view, but yes, this is definitely
>> what I want. We only have a few large memory nodes, so we want jobs that
>> request a lot of memory to have higher priority so they get assigned to
>> those large memory nodes ahead of lower-memory jobs which could run
>> anywhere else. But we don't want those nodes to sit idle if there's jobs
>> in the queue that need that much memory. Similar idea for IB - nodes
>> that need IB should get priority over nodes that don't
>>
>> Ideally, I wouldn't have such a heterogeneous environment, and then this
>> wouldn't be needed at all.
>>
>> I agree this opens another avenue for unscrupulous users to game the
>> system, but that (in theory) can be policed by looked at memory
>> requested vs. memory used in the accounting data to identify any abusers
>> and then give them a stern talking to.
>>
>> Prentice
>>
>>
>> On 4/18/19 5:27 PM, Ryan Novosielski wrote:
>>> This is not an official answer really, but I’ve always just considered this 
>>> to be the way that the scheduler works. It wants to get work completed, so 
>>> it will have a bias toward doing what is possible vs. not (can’t use 239GB 
>>> of RAM on a 128GB node). And really, is a higher priority what you want? 
>>> I’m not so sure. How soon will someone figure out that they might get a 
>>> higher priority based on requesting some feature they don’t need?
>>>
>>> --
>>> 
>>> || \\UTGERS, 
>>> |---*O*---
>>> ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>>> ||  \\of NJ  | Office of Advanced Research Computing - MSB C630, 
>>> Newark
>>> `'
>>>
 On Apr 18, 2019, at 5:20 PM, Prentice Bisbal  wrote:

 Slurm-users,

 Is there away to increase a jobs priority based on the resources or 
 constraints it has requested?

 For example, we have a very heterogeneous cluster here: Some nodes only 
 have 1 Gb Ethernet, some have 10 Gb Ethernet, and others have DDR IB. In 
 addition, we have some large memory nodes with RAM amounts ranging from 
 128 GB up to 512 GB. To allow a user to request IB, I have implemented 
 that as a feature in the node definition so users can request that as a 
 constraint.

 I would like to make it that if a job request IB, it's priority will go 
 up, or if it requests a lot of memory (specifically memory-per-cpu), it's 
 priority will go up proportionately to the amount of memory requested. Is 
 this possible? If so, how?

 I have tried going through the documentation, and googling, but 'priority' 
 is used to discuss job priority so much, I couldn't find any search 
 results relevant to this.

 -- 
 Prentice


> 
> 

-- 
Dr. Pawel Dziekonski 
KAUST Advanced Computing Core Laboratory
https://www.hpc.kaust.edu.sa



[slurm-users] scontrol for a heterogenous job appears incorrect

2019-04-23 Thread Jeffrey R. Lang
I'm testing using heterogenous jobs for a user on out cluster, but seeing I 
think incorrect output from "scontrol show job XXX" for the job. The cluster is 
currently using Slurm 18.08.

So my job script looks like this:

#!/bin/sh

### This is a general SLURM script. You'll need to make modifications for this 
to
### work with the appropriate packages that you want. Remember that the .bashrc
### file will get executed on each node upon login and any settings in this 
script
### will be in addition to, or will override, the system bashrc file settings. 
Users will
### find it advantageous to use only the specific modules they want or
### specify a certain PATH environment variable, etc. If you have questions,
### please contact the ARCC at arcc-i...@uwyo.edu for help.

### Informational text is usually indicated by "###". Don't uncomment these 
lines.

### Lines beginning with "#SBATCH" are SLURM directives. They tell SLURM what 
to do.
### For example, #SBATCH --job-name my_job tells SLURM that the name of the job 
is "my_job".
### Don't remove the "#SBATCH".

### Job Name
#SBATCH --job-name=CHECK_NODE

### Declare an account for the job to run under
#SBATCH --account=arcc

### Standard output stream files are have a default name of:
### "slurm_.out" However, this can be changed using options
### below. If you would like stdout and stderr to be combined,
### omit the "SBATCH -e" option below.
###SBATCH -o stdout_file
###SBATCH -e stderr_file

### mailing options
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=xxx

### Set max walltime (days-hours:minutes:seconds)
#SBATCH --time=0-01:00:00

### Specify Resources
### 2 nodes, 16 processors (cores) each node
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=1  --partition=teton-hugemem
#SBATCH packjob
#SBATCH --nodes=9 --ntasks-per-node=32 --partition=teton

### Load needed modules
#module load gcc/7.3.0
#module load swset/2018.05
#module load openmpi/3.1.0

### Start the job via launcher.
### Command normally given on command line
srun check_nodes

sleep 600


When I submit the job and check it with "scontrol show job XXX"

JobId=2607083 PackJobId=2607082 PackJobOffset=1 JobName=CHECK_NODE
   PackJobIdSet=2607082-2607083
   UserId=jrlang(10024903) GroupId=jrlang(10024903) MCS_label=N/A
  Priority=1086 Nice=0 Account=arcc QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:03:33 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-04-23T15:42:45 EligibleTime=2019-04-23T15:42:45
   AccrueTime=2019-04-23T15:42:45
   StartTime=2019-04-23T15:42:49 EndTime=2019-04-23T16:42:49 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-23T15:42:49
   Partition=teton AllocNode:Sid=tmgt1:34097
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=t[456-464]
   BatchHost=t456
   NumNodes=9 NumCPUs=288 NumTasks=288 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=288,mem=288000M,node=9,billing=288
   Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/tsfs1/home/jrlang/TEST_CODE/check_nodes.sbatch
   WorkDir=/pfs/tsfs1/home/jrlang/TEST_CODE
   StdErr=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2607083.out
   StdIn=/dev/null
   StdOut=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2607083.out
   Power=

Looking at the nodelist and the NumNodes they are both incorrect.   They should 
show the first node and then the additional nodes assigned.

Using pestat I see the 10 nodes allocated to the job.

t456   tetonalloc  32  320.01*   128000   119539  2607083 
jrlang
t457   tetonalloc  32  320.01*   128000   119459  2607083 
jrlang
t458   tetonalloc  32  320.03*   128000   119854  2607083 
jrlang
t459   tetonalloc  32  320.01*   128000   119567  2607083 
jrlang
t460   tetonalloc  32  320.01*   128000   119567  2607083 
jrlang
t461   tetonalloc  32  320.01*   128000   119308  2607083 
jrlang
t462   tetonalloc  32  320.01*   128000   119570  2607083 
jrlang
t463   tetonalloc  32  320.01*   128000   119241  2607083 
jrlang
t464   tetonalloc  32  320.01*   128000   119329  2607083 
jrlang
   thm03   teton-hugemem  mix   1  320.01*  1024000  1017834  2607082 
jrlang

So why is scontrol not showing the thm03 node in the nodelist and including it 
in the Numnodes?

One other question is how does Slurm treat the job output for the job. This job 
is a "hello world" type which just outputs the nodes the node and rack the 
parts run on.  When the job completes I only see one line in the output from 
the Rack 0 task.

So where is all the Rank output ending up?

Jeff







Re: [slurm-users] scontrol for a heterogenous job appears incorrect

2019-04-24 Thread Jeffrey R. Lang
Chris

Upon further testing this morning I see the job is assigned two different 
jobid's, something I wasn't expecting.  This lead me down the road  of thinking 
the output was incorrect.

Scontrol on a hetro job will show multi-jobids for the job. So, the output just 
wasn't what I was expecting.

Jeff

[jrlang@tlog1 TEST_CODE]$ sbatch check_nodes.sbatch
Submitted batch job 2611773
 [jrlang@tlog1 TEST_CODE]$ squeue | grep jrlang
 2611773+1 teton CHECK_NO   jrlang  R   0:10  9 t[439-447]
 2611773+0 teton-hug CHECK_NO   jrlang  R   0:10  1 thm03
[jrlang@tlog1 TEST_CODE]$ pestat | grep jrlang
t439   tetonalloc  32  320.02*   128000   119594  2611774 
jrlang  
t440   tetonalloc  32  320.02*   128000   119542  2611774 
jrlang  
t441   tetonalloc  32  320.01*   128000   119760  2611774 
jrlang  
t442   tetonalloc  32  320.01*   128000   121491  2611774 
jrlang  
t443   tetonalloc  32  320.02*   128000   119893  2611774 
jrlang  
t444   tetonalloc  32  320.02*   128000   119607  2611774 
jrlang  
t445   tetonalloc  32  320.03*   128000   119626  2611774 
jrlang  
t446   tetonalloc  32  320.01*   128000   119882  2611774 
jrlang  
t447   tetonalloc  32  320.01*   128000   120037  2611774 
jrlang  
   thm03   teton-hugemem  mix   1  320.01*  1024000  1017845  2611773 
jrlang  
[jrlang@tlog1 TEST_CODE]$ scontrol show job 2611773
JobId=2611773 PackJobId=2611773 PackJobOffset=0 JobName=CHECK_NODE
   PackJobIdSet=2611773-2611774
   UserId=jrlang(10024903) GroupId=jrlang(10024903) MCS_label=N/A
   Priority=1004 Nice=0 Account=arcc QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:01:59 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-04-24T09:03:00 EligibleTime=2019-04-24T09:03:00
   AccrueTime=2019-04-24T09:03:00
   StartTime=2019-04-24T09:03:20 EndTime=2019-04-24T10:03:20 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-24T09:03:20
   Partition=teton-hugemem AllocNode:Sid=tlog1:24498
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=thm03
   BatchHost=thm03
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/tsfs1/home/jrlang/TEST_CODE/check_nodes.sbatch
   WorkDir=/pfs/tsfs1/home/jrlang/TEST_CODE
   StdErr=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611773.out
   StdIn=/dev/null
   StdOut=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611773.out
   Power=

JobId=2611774 PackJobId=2611773 PackJobOffset=1 JobName=CHECK_NODE
   PackJobIdSet=2611773-2611774
   UserId=jrlang(10024903) GroupId=jrlang(10024903) MCS_label=N/A
   Priority=1086 Nice=0 Account=arcc QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:01:59 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-04-24T09:03:00 EligibleTime=2019-04-24T09:03:00
   AccrueTime=2019-04-24T09:03:00
   StartTime=2019-04-24T09:03:20 EndTime=2019-04-24T10:03:20 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-24T09:03:20
   Partition=teton AllocNode:Sid=tlog1:24498
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=t[439-447]
   BatchHost=t439
   NumNodes=9 NumCPUs=288 NumTasks=288 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=288,mem=288000M,node=9,billing=288
   Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/tsfs1/home/jrlang/TEST_CODE/check_nodes.sbatch
   WorkDir=/pfs/tsfs1/home/jrlang/TEST_CODE
   StdErr=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611774.out
   StdIn=/dev/null
   StdOut=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611774.out
   Power=


-Original Message-
From: slurm-users  On Behalf Of Chris 
Samuel
Sent: Tuesday, April 23, 2019 7:39 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] scontrol for a heterogenous job appears incorrect

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


On 23/4/19 3:02 pm, Jeffrey R. Lang wrote:

> Looking at the nodelist and the NumNodes they are both incorrect.   They
> should show the first node and then the additional nodes assigned.

You're only looking at the second of the two pack jobs for your submission, 
could they be assigned in the first one of the pack jobs instead?

All the bes

Re: [slurm-users] Slurm database failure messages

2019-05-07 Thread Pawel R. Dziekonski
On 07/05/2019 13.47, David Baker wrote:

> We are experiencing quite a number of database failures. 

> [root@blue51 slurm]#*less slurmdbd.log-20190506.gz | grep failed*
> [2019-05-05T04:00:05.603] error: mysql_query failed: 1213 Deadlock found when 
> trying to get lock; try restarting transaction

Open ticket to SchedMD immediately, if you have a support contract!

We had recently a possibly similar case:
https://bugs.schedmd.com/show_bug.cgi?id=6922
It was about a reservation that ended 2 years ago ... :p

Are you sure that there are no more messages that would indicate
source of the problem in your case?

P



-- 
Dr. Pawel Dziekonski 
KAUST Advanced Computing Core Laboratory
https://www.hpc.kaust.edu.sa



[slurm-users] question about partition definition

2019-12-09 Thread Jeffrey R. Lang
I need to set up a partition that limits the number of jobs allowed to run at 
one time.   Looking at the slurm.conf page for partition definitions I don't 
see a MaxJobs option.

Is there a way to limit the number of jobs in a partition?

Thanks, Jeff


[slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-02-20 Thread Nathan R Crawford
Hi All,

  I have 19.05.4 and want to change SelectType from select/cons_res to
select/cons_tres without losing running or pending jobs. The documentation
is a bit conflicting.

>From the man page:
SelectType
  Identifies the type of resource selection algorithm to be used. Changing
this value can only be done by restarting the slurmctld daemon and will
result in the loss of all job information (running and pending) since the
job state save format used by each plugin is different.

>From slurm.schedmd.com/SLUG19/Slurm_19.05.pdf, slide 6:
● Can revert to cons_res without loosing the queue
  ○ Although jobs using new cons_tres options cannot run
  ○ Both share a common state format to make this possible
■ Unlike cons_tres ⇎ serial which will drop the queue

  I interpret this as, in general, changing SelectType will nuke existing
jobs, but that since cons_tres uses the same state format as cons_res, it
should work.

  Has anyone done this on a running system?

Thanks,
Nate


-- 

Dr. Nathan Crawford  nathan.crawf...@uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA


Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-02-21 Thread Nathan R Crawford
Hi Chris,

  If it just requires restarting slurmctld and the slurmd processes on the
nodes, I will be happy! Can you confirm that no running or pending jobs
were lost in the transition?

Thanks,
Nate

On Thu, Feb 20, 2020 at 6:54 PM Chris Samuel  wrote:

> On 20/2/20 2:16 pm, Nathan R Crawford wrote:
>
> >I interpret this as, in general, changing SelectType will nuke
> > existing jobs, but that since cons_tres uses the same state format as
> > cons_res, it should work.
>
> We got caught with just this on our GPU nodes (though it was fixed
> before I got to see what was going on) - it seems that the format of the
> RPCs changes when you go from cons_res to cons_tres and we were having
> issues until we restarted slurmd on the compute nodes as well.
>
> My memory is that this was causing issues for starting new jobs (in a
> failing completely type of manner), I'm not sure what the consequences
> were for running jobs (though I suspect it would not have been great for
> them).
>
> If Doug sees this he may remember this (he caught and fixed it).
>
> All the best,
> Chris
> --
>   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>

-- 

Dr. Nathan Crawford  nathan.crawf...@uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA


[slurm-users] Question about determining pre-empted jobs

2020-02-28 Thread Jeffrey R. Lang
I need your help.

We have had a request to generate a report showing the number of jobs by date 
showing pre-empted jobs.   We used sacct to try to gather the data but we only 
found a few jobs with the state "PREEMPTED".

Scanning the slurmd logs we find there are a lot of job that show pre-empted.

What is the best way to gather or discover this data?

Thanks
Jeff


Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?

2020-03-29 Thread Nathan R Crawford
Sounds pretty safe, but with the current COVID-19 difficulties, including
Spring quarter classes being taught remotely (starts tomorrow, fun times
ahead), I'm a bit reluctant to poke a running system. This will get put on
the giant list of things waiting for a scheduled downtime once campus
re-opens.
Thanks,
Nate

On Thu, Mar 26, 2020 at 8:26 AM Steven Dick  wrote:

> When I changed this on a running system, no jobs were killed, but
> slurm lost track of jobs on nodes and was unable to kill them or tell
> when they were finished until slurmd on each node was restarted.  I
> let running jobs complete and monitored them manually, and restarted
> slurmd on each node as they finished.
>
> In desperation, you can do it, but it might be better to wait until no
> jobs (or few jobs) are running.
>
> On Thu, Mar 26, 2020 at 10:40 AM Pär Lindfors 
> wrote:
> >
> > Hi Nate,
> >
> > On Fri, 2020-02-21 at 11:38 -0800, Nathan R Crawford wrote:
> > >   If it just requires restarting slurmctld and the slurmd processes
> > > on the nodes, I will be happy! Can you confirm that no running or
> > > pending jobs were lost in the transition?
> >
> > Did you change your SelectType to cons_tres? How did it go?
> >
> > We need to do the same change on one of our clusters. I have done a few
> > tests on a tiny test cluster which so far indicates that changing works
> > even with jobs running, but a configuration change with even a small
> > risk of purging the job list makes me a little nervous.
> >
> > Regards,
> > Pär Lindfors,
> > UPPMAX
> >
> >
> >
> >
> >
> >
> >
> >
> > När du har kontakt med oss på Uppsala universitet med e-post så innebär
> det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör
> det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
> >
> > E-mailing Uppsala University means that we will process your personal
> data. For more information on how this is performed, please read here:
> http://www.uu.se/en/about-uu/data-protection-policy
>
>

-- 

Dr. Nathan Crawford  nathan.crawf...@uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA


[slurm-users] SLURM 20.11.0 no x11 forwarding.

2021-04-22 Thread Luis R. Torres
Hi Folks,

I'm currently running a small but powerful 10 node cluster where we require
the scheduling of certain graphical apps.  Our SLURM version is 20.11.0,
half our nodes are RHEL7 and the other half Ubuntu 18.04.

Our slurm config related to x11 is:

PrologFlags=*x*11

*X*11Parameters=home_*x*authority

We get the following error when attempting to use forwarding using SLURM


srun --nodelist=node01 --x11 xeyes

X11 connection rejected because of wrong authentication.

Error: Can't open display: localhost:78.0

srun: error: node01: task 0: Exited with exit code 1

The steps we take are:
ssh -X user@login-node
[user@login-node]$ srun --nodelist=node01 --x11 xeyes

No user accounts are allowed to SSH directly into execution nodes, only a
few special user accounts, however, those accounts have issues when using
--x11 but have NO issues when connecting directly (and forwarding) to the
execution nodes.

We went through this group to determine if anyone else has had similar
issues resolved but I didn't find anything other than some
recommendations.  Has someone actually resolved this particular (or very
similar) issue?

Our sshd_config is as follows:

X11Forwarding yes

X11DisplayOffset 10

X11UseLocalhost no


Our cluster is configured with SlurmUser=slurm, not root.


Thanks,
-- 
----
Luis R. Torres


Re: [slurm-users] SLURM 20.11.0 no x11 forwarding.

2021-04-23 Thread Luis R. Torres
I believe that was the case, we compiled it with x11 support, however,
further debugging suggests that there's an issue writing to the .Xauthority
file when using forwarding through srun.


[slurm-users] Exposing only requested CPUs to a job on a given node.

2021-05-14 Thread Luis R. Torres
Hi Folks,

We are currently running on SLURM 20.11.6 with cgroups constraints for
memory and CPU/Core.  Can the scheduler only expose the requested number of
CPU/Core resources to a job?  We have some users that employ python scripts
with the multi processing modules, and the scripts apparently use all of
the CPU/Cores in a node, despite using options to constraint a task to just
a given number of CPUs.We would like several multiprocessing jobs to
run simultaneously on the nodes, but not step on each other.

The sample script I use for testing is below; I'm looking for something
similar to what can be done with the GPU Gres configuration where only the
number of GPUs requested are exposed to the job requesting them.


#!/usr/bin/env python3

import multiprocessing


def worker():

print("Worker on CPU #%s" % multiprocessing.current_process

().name)

result=0

for j in range(20):

  result += j**2

print ("Result on CPU {} is {}".format(multiprocessing.curr

ent_process().name,result))

return


if __name__ == '__main__':

pool = multiprocessing.Pool()

jobs = []

print ("This host exposed {} CPUs".format(multiprocessing.c

pu_count()))

for i in range(multiprocessing.cpu_count()):

p = multiprocessing.Process(target=worker, name=i).star

t()

Thanks,
-- 
----
Luis R. Torres


Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-07-01 Thread Luis R. Torres
Hi Folks,

Thank you for your responses, I wrote the following configuration in
cgroup.conf along the appropriate slurm.conf
changes and I wrote a program to verify affinity whe queued or running in
the cluster.  results are below.  Thanks so much.

###

#

# Slurm cgroup support configuration file

#

# See man slurm.conf and man cgroup.conf for further

# information on cgroup configuration parameters

#--

CgroupAutomount=yes

CgroupMountpoint=/sys/fs/cgroup

#ConstrainCores=no

ConstrainCores=yes

ConstrainRAMSpace=yes

ConstrainDevices=no

ConstrainKmemSpace=no #Avoid a known kernel issue

ConstrainSwapSpace=yes

TaskAffinity=no #Use task/affinity plugin instead
-

srun --tasks=1 --cpus-per-task=1 --partition=long show-affinity.py

pid 1122411's current affinity mask: *401*


=

CPUs in system:  20

PID:  1122411

Allocated CPUs/Cores:  *2*

Affinity List:  *{0, 10}*

=

srun --tasks=1 --cpus-per-task=4 --partition=long show-affinity.py

pid 1122446's current affinity mask: *c03*


=

CPUs in system:  20

PID:  1122446

Allocated CPUs/Cores:  *4*

Affinity List:  *{0, 1, 10, 11}*

=


srun --tasks=1 --cpus-per-task=6 --partition=long show-affinity.py

pid 1122476's current affinity mask: *1c07*


=

CPUs in system:  20

PID:  1122476

Allocated CPUs/Cores:  *6*

Affinity List:  *{0, 1, 2, 10, 11, 12}*

=

On Fri, May 14, 2021 at 1:35 PM Luis R. Torres  wrote:

> Hi Folks,
>
> We are currently running on SLURM 20.11.6 with cgroups constraints for
> memory and CPU/Core.  Can the scheduler only expose the requested number of
> CPU/Core resources to a job?  We have some users that employ python scripts
> with the multi processing modules, and the scripts apparently use all of
> the CPU/Cores in a node, despite using options to constraint a task to just
> a given number of CPUs.We would like several multiprocessing jobs to
> run simultaneously on the nodes, but not step on each other.
>
> The sample script I use for testing is below; I'm looking for something
> similar to what can be done with the GPU Gres configuration where only the
> number of GPUs requested are exposed to the job requesting them.
>
>
> #!/usr/bin/env python3
>
> import multiprocessing
>
>
> def worker():
>
> print("Worker on CPU #%s" % multiprocessing.current_process
>
> ().name)
>
> result=0
>
> for j in range(20):
>
>   result += j**2
>
> print ("Result on CPU {} is {}".format(multiprocessing.curr
>
> ent_process().name,result))
>
> return
>
>
> if __name__ == '__main__':
>
> pool = multiprocessing.Pool()
>
> jobs = []
>
> print ("This host exposed {} CPUs".format(multiprocessing.c
>
> pu_count()))
>
> for i in range(multiprocessing.cpu_count()):
>
> p = multiprocessing.Process(target=worker, name=i).star
>
> t()
>
> Thanks,
> --
> 
> Luis R. Torres
>


-- 

Luis R. Torres


Re: [slurm-users] How to avoid a feature?

2021-07-02 Thread Jeffrey R. Lang
How about using node weights.Weight the non-gpu nodes so that they are 
scheduled first.  The GPU nodes could have a very high weight so that the 
scheduler would consider them last for allocation. This would allow the non-gpu 
nodes to be filled first and when full schedule the GPU nodes.   User needing a 
GPU could just include a feature request which should allocate the GPU nodes as 
necessary.

Jeff


-Original Message-
From: slurm-users  On Behalf Of Loris 
Bennett
Sent: Friday, July 2, 2021 12:48 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] How to avoid a feature?

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


Hi Tina,

Tina Friedrich  writes:

> Hi Brian,
>
> sometimes it would be nice if SLURM had what Grid Engine calls a 'forced
> complex' (i.e. a feature that you *have* to request to land on a node that has
> it), wouldn't it?
>
> I do something like that for all of my 'special' nodes (GPU, KNL, nodes...) - 
> I
> want to avoid jobs not requesting that resource or allowing that architecture
> landing on it. I 'tag' all nodes with a relevant feature (cpu, gpu, knl, ...),
> and have a LUA submit verifier that checks for a 'relevant' feature (or a
> --gres=gpu or somthing) and if there isn't one I add the 'cpu' feature to the
> request.
>
> Works for us!

We just have the GPU nodes in a separate partition 'gpu' which users
have to specify if they want a GPU.  How does that approach differ from
yours in terms of functionality for you (or the users)?

The main problem with our approach is that the CPUs on the GPU nodes can
remain idle while there is a queue for the regular CPU nodes.  What I
would like is to allow short CPU-only jobs to run on the GPUs but only
allow GPU-jobs to run for longer, which I guess I could probably do
within the submit plugin.

Cheers,

Loris


> Tina
>
> On 01/07/2021 15:08, Brian Andrus wrote:
>> All,
>>
>> I have a partition where one of the nodes has a node-locked license.
>> That license is not used by everyone that uses the partition.
>> They are cloud nodes, so weights do not work (there is an open bug about
>> that).
>>
>> I need to have jobs 'avoid' that node by default. I am thinking I can use a
>> feature constraint, but that seems to only apply to those that want the
>> feature. Since we have so many other users, it isn't feasible to have them
>> modify their scripts, so having it avoid by default would work.
>>
>> Any ideas how to do that? Submit LUA perhaps?
>>
>> Brian Andrus
>>
>>
--
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



[slurm-users] Assigning two "cores" when I'm only request one.

2021-07-12 Thread Luis R. Torres
Hi Folks,

I'm trying to run one task on one "core", however, when I test the
affinity, the system gives me "two"; I'm assuming the two are threads since
the system is a dual socket system.  Is there anything in the configuration
that I can change to have a single core or thread assigned to a
single-processing job by default?


srun --ntasks=1 --cpus-per-task=1 show-affinity.py

pid 7899's current affinity mask: 401


=

CPUs in system:  20

PID:  7899

Allocated CPUs/Cores:  2

Affinity List:  {0, 10}

=



-- 
----
Luis R. Torres


[slurm-users] big increase of MaxStepCount?

2022-01-12 Thread John R Anderson
hello,
 a user has requested that we set MaxStepCount to "unlimited" or 16million to 
accommodate some of their desired workflows. i searched around for details 
about this parameter & don't see alot, and i reviewed  
https://bugs.schedmd.com/show_bug.cgi?id=5722

any thoughts on this? can this successfully be applied to a partition or 
individual nodes only? i wonder about log files exploding or worse...

thanks!



[University of Nevada, Reno]<http://www.unr.edu/>
John R. Anderson
High-Performance Computing Engineer
Office of Information Technology
University of Nevada, Reno

email: j...@unr.edu<mailto:j...@unr.edu>




Re: [slurm-users] Fwd: useradd: group 'slurm' does not exist

2022-01-25 Thread Jeffrey R. Lang
Looking at what you provided in your email the groupadd commands are failing, 
due to the requested GID 991 and 992 already being assigned by the system your 
installing on.

Check the /etc/group file and find two GID numbers lower than 991 that are 
unused and use those instead.  Keep them in the 900 range and going to low can 
run into system GID assignments.

Once you have selected your new GID’s use them in the groupadd commans to 
create the proper groups.  Also use these new GID’s in the useradd commands as 
appropriate.



From: slurm-users  On Behalf Of Nousheen
Sent: Tuesday, January 25, 2022 2:39 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Fwd: useradd: group 'slurm' does not exist

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


Hello everyone,

I am struggling with the installation of slurm on Centos 7. while following 
this tutorial 
https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/
, after the installation of MariaDB, I try to create users for slurm and munge 
but following the same sequence of commands as in the tutorial gives me the 
following error.

Package 1:mariadb-server-5.5.68-1.el7.x86_64 is obsoleted by 
mysql-community-server-5.7.37-1.el7.x86_64 which is already installed
Package 1:mariadb-devel-5.5.68-1.el7.x86_64 is obsoleted by 
mysql-community-devel-5.7.37-1.el7.x86_64 which is already installed
Nothing to do
[root@exxact slurm]# export MUNGEUSER=991
[root@exxact slurm]# groupadd -g $MUNGEUSER munge
groupadd: GID '991' already exists
[root@exxact slurm]# useradd  -m -c "MUNGE Uid 'N' Gid Emporium" -d 
/var/lib/munge -u $MUNGEUSER -g munge  -s /sbin/nologin munge
useradd: group 'munge' does not exist
[root@exxact slurm]# export SLURMUSER=992
[root@exxact slurm]# groupadd -g $SLURMUSER slurm
groupadd: GID '992' already exists
[root@exxact slurm]# useradd  -m -c "SLURM workload manager" -d /var/lib/slurm 
-u $SLURMUSER -g slurm  -s /bin/bash slurm
useradd: group 'slurm' does not exist

I am totally new to this. Kindly guide me on how to resolve this.

Best Regards,
Nousheen Parvaiz
ᐧ
ᐧ


Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-27 Thread Jeffrey R. Lang
The missing file error has nothing to do with slurm.  The systemctl command is 
part of the systems service management.

The error message indicates that you haven’t copied the slurmd.service file on 
your compute node to /etc/systemd/system or /usr/lib/systemd/system.  
/etc/systemd/system is usually used when a user adds a new service to a machine.

Depending on your version of Linux you may also need to do a systemctl 
daemon-reload to activate the slurmd.service within system.

Once slurmd.service is copied over, the systemctld command should work just 
fine.

Remember:
slurmd.service -  Only on compute nodes
slurmctld.service – Only on your cluster management node
  slurmdbd.service – Only on your cluster management node

From: slurm-users  On Behalf Of Nousheen
Sent: Thursday, January 27, 2022 3:54 AM
To: Slurm User Community List 
Subject: [slurm-users] systemctl enable slurmd.service Failed to execute 
operation: No such file or directory

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


Hello everyone,

I am installing slurm on Centos 7 following tutorial: 
https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/

I am at the step where we start slurm but it gives me the following error:

[root@exxact slurm-21.08.5]# systemctl enable slurmd.service
Failed to execute operation: No such file or directory

I have run the command to check if slurm is configured properly

[root@exxact slurm-21.08.5]# slurmd -C
NodeName=exxact CPUs=12 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 
ThreadsPerCore=2 RealMemory=31889
UpTime=19-16:06:00

I am new to this and unable to understand the problem. Kindly help me resolve 
this.

My slurm.conf file is as follows:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster194
SlurmctldHost=192.168.60.194
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=1
#MaxStepCount=4
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=nousheen
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/home/nousheen/Documents/SILICS/slurm-21.08.5/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=linux[1-32] CPUs=11 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP


Best Regards,
Nousheen Parvaiz
ᐧ


[slurm-users] Where is the documentation for saving batch script

2022-03-17 Thread Jeffrey R. Lang
Hello

  I want to look into the new feature of saving job scripts in the Slurm 
database but have been unable to find documentation on how to do it.   Can 
someone please point me in the right direction for the documentation or slurm 
configuration changes that need to be implemented?

Thanks
jeff


[slurm-users] Help with failing job execution

2022-03-24 Thread Jeffrey R. Lang
My site recently updated to Slurm 21.08.6 and for the most part everything went 
fine.  Two Ubuntu nodes however are having issues.Slurmd cannot execve the 
jobs on the nodes.  As an example:

[jrlang@tmgt1 ~]$ salloc -A ARCC --nodes=1 --ntasks=20 -t 1:00:00 --bell 
--nodelist=mdgx01 --partition=dgx /bin/bash
salloc: Granted job allocation 2328489
[jrlang@tmgt1 ~]$ srun hostname
srun: error: task 0 launch failed: Slurmd could not execve job
srun: error: task 1 launch failed: Slurmd could not execve job
srun: error: task 2 launch failed: Slurmd could not execve job
srun: error: task 3 launch failed: Slurmd could not execve job
srun: error: task 4 launch failed: Slurmd could not execve job
srun: error: task 5 launch failed: Slurmd could not execve job
srun: error: task 6 launch failed: Slurmd could not execve job
srun: error: task 7 launch failed: Slurmd could not execve job
srun: error: task 8 launch failed: Slurmd could not execve job
srun: error: task 9 launch failed: Slurmd could not execve job
srun: error: task 10 launch failed: Slurmd could not execve job
srun: error: task 11 launch failed: Slurmd could not execve job
srun: error: task 12 launch failed: Slurmd could not execve job
srun: error: task 13 launch failed: Slurmd could not execve job
srun: error: task 14 launch failed: Slurmd could not execve job
srun: error: task 15 launch failed: Slurmd could not execve job
srun: error: task 16 launch failed: Slurmd could not execve job
srun: error: task 17 launch failed: Slurmd could not execve job
srun: error: task 18 launch failed: Slurmd could not execve job
srun: error: task 19 launch failed: Slurmd could not execve job

Looking in slurmd-mdgx01.log we only see

[2022-03-24T14:44:02.408] [2328501.interactive] error: Failed to invoke task 
plugins: one of task_p_pre_setuid functions returned error
[2022-03-24T14:44:02.409] [2328501.interactive] error: job_manager: exiting 
abnormally: Slurmd could not execve job
[2022-03-24T14:44:02.411] [2328501.interactive] done with job


Note that this issues didn't occure with Slurm 20.11.8.

Any ideas what could be causing the issue, cause I'm stumped?

Jeff


[slurm-users] How to open a slurm support case

2022-03-24 Thread Jeffrey R. Lang
Can someone provide me with instructions on how to open a support case with 
SchedMD?

We have a support contract, but no where on their website can I find a link to 
open a case with them.

Thanks,
Jeff


[slurm-users] Preempt jobs to stay within account TRES limits?

2022-10-21 Thread Matthew R. Baney
Hello,

I have noticed that jobs submitted to non-preemptable partitions
(PreemptType = preempt/partition_prio and PreemptMode = REQUEUE) under
accounts with GrpTRES limits will become pending with AssocGrpGRES as the
reason when the account is up against the relevant limit, even when there
are other running jobs on preemptable partitions under the same account and
when the pending jobs have higher priority. The GRES in consideration are
GPUs.

It seems like the scheduler is checking to see if the pending jobs are
within the GRES limit for the account before considering if any of the
other jobs in the account are running on preemptable partitions. In some
specific observed cases, even preempting a single job running in a
preemptable partition would allow the non-preemptable partition job to run
(based on GRES freed up from preemption).

Is it possible to reverse the order in which these checks are evaluated?

Best,
Matthew

-- 
Matthew Baney
UMIACS Technical Staff
mba...@umd.edu | (301) 405-6756
University of Maryland Institute for Advanced Computer Studies
3154 Brendan Iribe Center
8125 Paint Branch Dr.
College Park, MD 20742


Re: [slurm-users] Per-user TRES summary?

2022-11-28 Thread Jeffrey R. Lang
You might try the slurmuserjobs command as part of the Slurm_tools package 
found here https://github.com/OleHolmNielsen/Slurm_tools



From: slurm-users  On Behalf Of Djamil 
Lakhdar-Hamina
Sent: Monday, November 28, 2022 5:49 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Per-user TRES summary?

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.



On Mon, Nov 28, 2022 at 10:17 AM Pacey, Mike 
mailto:m.pa...@lancaster.ac.uk>> wrote:
Hi folks,

Does anyone have suggestions as to how to produce a summary of a user’s TRES 
resources for running jobs? I’d like to able to see how each user is fairing 
against their qos resource limits. (I’m looking for something functionally 
equivalent to Grid Engine’s qquota command). The info must be in the scheduler 
somewhere in order for it to enforce qos TRES limits, but as a SLURM novice 
I’ve not found any way to do this.

To summarise TRES qos limits I can do this:

% sacctmgr list qos format=Name,MaxTRESPerUser%50
  Name  MaxTRESPU
-- --
normalcpu=80,mem=320G

But to work out what a user is currently using in currently running jobs, the 
nearest I can work out is:

% sacct -X -s R --units=G -o User,ReqTRES%50
 UserReqTRES
- --
pacey   billing=1,cpu=1,mem=0.49G,node=1
pacey   billing=1,cpu=1,mem=0.49G,node=1

With a little scripting I can sum those up, but there might be a neater way to 
do this?

Regards,
Mike



Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jeffrey R. Lang
The service is available in RHEL 8 via the EPEL package repository as 
system-networkd, i.e. systemd-networkd.x86_64   
253.4-1.el8epel


-Original Message-
From: slurm-users  On Behalf Of Ole Holm 
Nielsen
Sent: Monday, October 30, 2023 1:56 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] How to delay the start of slurmd until 
Infiniband/OPA network is fully up?

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


Hi Jens,

Thanks for your feedback:

On 30-10-2023 15:52, Jens Elkner wrote:
> Actually there is no need for such a script since
> /lib/systemd/systemd-networkd-wait-online should be able to handle it.

It seems that systemd-networkd exists in Fedora FC38 Linux, but not in
RHEL 8 and clones, AFAICT.

/Ole




[slurm-users] Cleanup of old clusters in database

2024-01-10 Thread Jeffrey R. Lang
We have shuttered two clusters and need to remove them from the database.  To 
do this, do we  remove the table spaces associated with the cluster names from 
the Slurm database?

Thanks,
Jeff




Re: [slurm-users] PMIx and Slurm

2017-11-28 Thread r...@open-mpi.org
Very true - one of the risks with installing from packages. However, be aware 
that slurm 17.02 doesn’t support PMIx v2.0, and so this combination isn’t going 
to work anyway.

If you want PMIx v2.x, then you need to pair it with SLURM 17.11.

Ralph

> On Nov 28, 2017, at 2:32 PM, Philip Kovacs  wrote:
> 
> This issue is that pmi 2.0+ provides a "backward compatibility" feature, 
> enabled by default, which installs
> both libpmi.so and libpmi2.so in addition to libpmix.so.  The route with the 
> least friction for you would probably
> be to uninstall pmix, then install slurm normally, letting it install its 
> libpmi and libpmi2.  Next configure and compile
> a custom pmix with that backward feature _disabled_, so it only installs 
> libpmix.so.   Slurm will "see" the pmix library
> after you install it and load it via its plugin when you use --mpi=pmix.   
> Again, just use the Slurm pmi and pmi2 and 
> install pmix separately with the backward compatible option disabled.
> 
> There is a packaging issue there in which two packages are trying to install 
> their own versions of the same files.  
> That should be brought to attention of the packages.  Meantime you can work 
> around it.
> 
> For PMIX:
> 
> ./configure --disable-pmi-backward-compatibility // ... etc ...
> 
> 
> 
> On Tuesday, November 28, 2017 4:44 PM, Artem Polyakov  
> wrote:
> 
> 
> Hello, Paul
> 
> Please see below.
> 
> 2017-11-28 13:13 GMT-08:00 Paul Edmon  >:
> So in an effort to future proof ourselves we are trying to build Slurm 
> against PMIx, but when I tried to do so I got the following:
> 
> Transaction check error:
>   file /usr/lib64/libpmi.so from install of slurm-17.02.9-1fasrc02.el7.cen 
> tos.x86_64 conflicts with file from package pmix-2.0.2-1.el7.centos.x86_64
>   file /usr/lib64/libpmi2.so from install of slurm-17.02.9-1fasrc02.el7.cen 
> tos.x86_64 conflicts with file from package pmix-2.0.2-1.el7.centos.x86_64
> 
> This is with compiling Slurm with the --with-pmix=/usr option.  A few things:
> 
> 1. I'm surprised when I tell it to use PMIx it still builds its own versions 
> of libpmi and pmi2 given that PMIx handles that now.
> 
> PMIx is a plugin and from multiple perspectives it makes sense to keep the 
> other versions available (i.e. backward compat or perf comparison) 
>  
> 
> 2. Does this mean I have to install PMIx in a nondefault location?  If so how 
> does that work with user build codes?  I'd rather not have multiple versions 
> of PMI around for people to build against.
> When we introduced PMIx it was in the beta stage and we didn't want to build 
> against it by default. Now it probably makes sense to assume --with-pmix by 
> default.
> I'm also thinking that we might need to solve it at the packagers level by 
> distributing "slurm-pmix" package that is builded and depends on the pmix 
> package that is currently shipped with particular Linux distro.
>  
> 
> 3.  What is the right way of building PMIx and Slurm such that they 
> interoperate properly?
> As for now it is better to have a PMIx installed in the well-known location. 
> And then build your MPIs or other apps against this PMIx installation.
> Starting (I think) from PMIx v2.1 we will have a cross-version support that 
> will give some flexibility about what installation to use with application,
>  
> 
> Suffice it to say little to no documentation exists on how to properly this, 
> so any guidance would be much appreciated.
> Indeed we have some problems with the documentation as PMIx technology is 
> relatively new. Hopefully we can fix this in near future.
> Being the original developer of the PMIx plugin I'll be happy to answer any 
> questions and help to resolve the issues.
> 
> 
>  
> 
> 
> -Paul Edmon-
> 
> 
> 
> 
> 
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> 
> 



Re: [slurm-users] PMIx and Slurm

2017-11-28 Thread r...@open-mpi.org
My apologies - I guess we hadn’t been tracking it that way. I’ll try to add 
some clarification. We presented a nice table at the BoF and I just need to 
find a few minutes to post it.

I believe you do have to build slurm against PMIx so that the pmix plugin is 
compiled. You then also have to specify --mpi=pmix so slurm knows to use that 
plugin for this specific job.

You actually might be able to use the PMIx backward compatibility, and you 
might want to do so with slurm 17.11 and above as Mellanox did a nice job of 
further optimizing launch performance on IB platforms by adding fabric-based 
collective implementations to the pmix plugin. If you replace the slurm libpmi 
and libpmi2 with the ones from PMIx, what will happen is that PMI and PMI2 
calls will be converted to their PMIx equivalent and passed to the pmix plugin. 
This lets you take advantage of what Mellanox did.

The caveat is that your MPI might ask for some PMI/PMI2 feature that we didn’t 
implement. We have tested with MPICH as well as OMPI and it was fine - but we 
cannot give you a blanket guarantee (e.g., I’m pretty sure MVAPICH won’t work). 
Probably safer to stick with the slurm libs for that reason unless you test to 
ensure it all works.


> On Nov 28, 2017, at 6:42 PM, Paul Edmon  wrote:
> 
> Okay, I didn't see any note on the PMIx 2.1 page about versions of slurm it 
> was combatible with so I assumed all of them.  My bad.  Thanks for the 
> correction and the help.  I just naively used the rpm spec that was packaged 
> with PMIx which does enable the legacy support.  It seems best then to let 
> PMIx handle pmix solely and let slurm handle the rest.  Thanks!
> 
> Am I right in reading that you don't have to build slurm against PMIx?  So it 
> just interoperates with it fine if you just have it installed and specify 
> pmix as the launch option?  That's neat.
> -Paul Edmon-
> 
> On 11/28/2017 6:11 PM, Philip Kovacs wrote:
>> Actually if you're set on installing pmix/pmix-devel from the rpms and then 
>> configuring slurm manually,
>> you could just move the pmix-installed versions of libpmi.so* and 
>> libpmi2.so* to a safe place, configure
>> and install slurm which will drop in its versions pf those libs and then 
>> either use the slurm versions or move
>> the the pmix versions of libpmi and libpmi2 back into place in /usr/lib64. 
>> 
>> 
>> On Tuesday, November 28, 2017 5:32 PM, Philip Kovacs  
>>  wrote:
>> 
>> 
>> This issue is that pmi 2.0+ provides a "backward compatibility" feature, 
>> enabled by default, which installs
>> both libpmi.so and libpmi2.so in addition to libpmix.so.  The route with the 
>> least friction for you would probably
>> be to uninstall pmix, then install slurm normally, letting it install its 
>> libpmi and libpmi2.  Next configure and compile
>> a custom pmix with that backward feature _disabled_, so it only installs 
>> libpmix.so.   Slurm will "see" the pmix library
>> after you install it and load it via its plugin when you use --mpi=pmix.   
>> Again, just use the Slurm pmi and pmi2 and 
>> install pmix separately with the backward compatible option disabled.
>> 
>> There is a packaging issue there in which two packages are trying to install 
>> their own versions of the same files.  
>> That should be brought to attention of the packages.  Meantime you can work 
>> around it.
>> 
>> For PMIX:
>> 
>> ./configure --disable-pmi-backward-compatibility // ... etc ...
>> 
>> 
>> 
>> On Tuesday, November 28, 2017 4:44 PM, Artem Polyakov  
>>  wrote:
>> 
>> 
>> Hello, Paul
>> 
>> Please see below.
>> 
>> 2017-11-28 13:13 GMT-08:00 Paul Edmon > >:
>> So in an effort to future proof ourselves we are trying to build Slurm 
>> against PMIx, but when I tried to do so I got the following:
>> 
>> Transaction check error:
>>   file /usr/lib64/libpmi.so from install of slurm-17.02.9-1fasrc02.el7.cen 
>> tos.x86_64 conflicts with file from package pmix-2.0.2-1.el7.centos.x86_64
>>   file /usr/lib64/libpmi2.so from install of slurm-17.02.9-1fasrc02.el7.cen 
>> tos.x86_64 conflicts with file from package pmix-2.0.2-1.el7.centos.x86_64
>> 
>> This is with compiling Slurm with the --with-pmix=/usr option.  A few things:
>> 
>> 1. I'm surprised when I tell it to use PMIx it still builds its own versions 
>> of libpmi and pmi2 given that PMIx handles that now.
>> 
>> PMIx is a plugin and from multiple perspectives it makes sense to keep the 
>> other versions available (i.e. backward compat or perf comparison) 
>>  
>> 
>> 2. Does this mean I have to install PMIx in a nondefault location?  If so 
>> how does that work with user build codes?  I'd rather not have multiple 
>> versions of PMI around for people to build against.
>> When we introduced PMIx it was in the beta stage and we didn't want to build 
>> against it by default. Now it probably makes sense to assume --with-pmix by 
>> default.
>> I'm also thinking that we might

Re: [slurm-users] PMIx and Slurm

2017-11-28 Thread r...@open-mpi.org
Thanks for your patience and persistence. I’ll find a place to post your 
experiences to help others as they navigate these waters.


> On Nov 28, 2017, at 8:52 PM, Philip Kovacs  wrote:
> 
> I doubled checked and yes, you definitely want the pmix headers and libpmix 
> library installed before you configure slurm.
> No need to use --with-pmix if pmix is installed in standard system locations. 
> Configure slurm and it will see the pmix 
> installation.  After configuring slurm, but before installing it, manually 
> remove the pmix versions of libpmi.so* and libpmi2.so*. 
> Install slurm and use its versions of those libs.  Test every mpi variant 
> seen when you run `srun --mpi=list hostname`.  
> You should see pmi2 and pmix in that list and several others.   The pmix 
> option will invoke a slurm plugin that is linked 
> directly to the libpmix.so library.  If you favor using the pmix versions of 
> pmi/pmi2, sounds like you'll get better performance
> when using pmi/pmi2, but as mentioned, you would want to test every mpi 
> variant listed to make sure everything works.
> 
> 
> On Tuesday, November 28, 2017 9:57 PM, "r...@open-mpi.org" 
>  wrote:
> 
> 
> My apologies - I guess we hadn’t been tracking it that way. I’ll try to add 
> some clarification. We presented a nice table at the BoF and I just need to 
> find a few minutes to post it.
> 
> I believe you do have to build slurm against PMIx so that the pmix plugin is 
> compiled. You then also have to specify --mpi=pmix so slurm knows to use that 
> plugin for this specific job.
> 
> You actually might be able to use the PMIx backward compatibility, and you 
> might want to do so with slurm 17.11 and above as Mellanox did a nice job of 
> further optimizing launch performance on IB platforms by adding fabric-based 
> collective implementations to the pmix plugin. If you replace the slurm 
> libpmi and libpmi2 with the ones from PMIx, what will happen is that PMI and 
> PMI2 calls will be converted to their PMIx equivalent and passed to the pmix 
> plugin. This lets you take advantage of what Mellanox did.
> 
> The caveat is that your MPI might ask for some PMI/PMI2 feature that we 
> didn’t implement. We have tested with MPICH as well as OMPI and it was fine - 
> but we cannot give you a blanket guarantee (e.g., I’m pretty sure MVAPICH 
> won’t work). Probably safer to stick with the slurm libs for that reason 
> unless you test to ensure it all works.
> 
> 
>> On Nov 28, 2017, at 6:42 PM, Paul Edmon > <mailto:ped...@cfa.harvard.edu>> wrote:
>> 
> 
> Okay, I didn't see any note on the PMIx 2.1 page about versions of slurm it 
> was combatible with so I assumed all of them.  My bad.  Thanks for the 
> correction and the help.  I just naively used the rpm spec that was packaged 
> with PMIx which does enable the legacy support.  It seems best then to let 
> PMIx handle pmix solely and let slurm handle the rest.  Thanks!
> Am I right in reading that you don't have to build slurm against PMIx?  So it 
> just interoperates with it fine if you just have it installed and specify 
> pmix as the launch option?  That's neat.
> -Paul Edmon-
> 
> On 11/28/2017 6:11 PM, Philip Kovacs wrote:
>> Actually if you're set on installing pmix/pmix-devel from the rpms and then 
>> configuring slurm manually,
>> you could just move the pmix-installed versions of libpmi.so* and 
>> libpmi2.so* to a safe place, configure
>> and install slurm which will drop in its versions pf those libs and then 
>> either use the slurm versions or move
>> the the pmix versions of libpmi and libpmi2 back into place in /usr/lib64. 
>> 
>> 
>> On Tuesday, November 28, 2017 5:32 PM, Philip Kovacs  
>> <mailto:pkde...@yahoo.com> wrote:
>> 
>> 
>> This issue is that pmi 2.0+ provides a "backward compatibility" feature, 
>> enabled by default, which installs
>> both libpmi.so and libpmi2.so in addition to libpmix.so.  The route with the 
>> least friction for you would probably
>> be to uninstall pmix, then install slurm normally, letting it install its 
>> libpmi and libpmi2.  Next configure and compile
>> a custom pmix with that backward feature _disabled_, so it only installs 
>> libpmix.so.   Slurm will "see" the pmix library
>> after you install it and load it via its plugin when you use --mpi=pmix.   
>> Again, just use the Slurm pmi and pmi2 and 
>> install pmix separately with the backward compatible option disabled.
>> 
>> There is a packaging issue there in which two packages are trying to install 
>> their own versions of the same files.  
>> That should be brough

Re: [slurm-users] OpenMPI & Slurm: mpiexec/mpirun vs. srun

2017-12-18 Thread r...@open-mpi.org
Repeated here from the OMPI list:

We have had reports of applications running faster when executing under OMPI’s 
mpiexec versus when started by srun. Reasons aren’t entirely clear, but are 
likely related to differences in mapping/binding options (OMPI provides a very 
large range compared to srun) and optimization flags provided by mpiexec that 
are specific to OMPI.

OMPI uses PMIx for wireup support (starting with the v2.x series), which 
provides a faster startup than other PMI implementations. However, that is also 
available with Slurm starting with the 16.05 release, and some further 
PMIx-based launch optimizations were recently added to the Slurm 17.11 release. 
So I would expect that launch via srun with the latest Slurm release and PMIx 
would be faster than mpiexec - though that still leaves the faster execution 
reports to consider.

HTH
Ralph


> On Dec 18, 2017, at 2:26 PM, Prentice Bisbal  wrote:
> 
> Slurm users,
> 
> I've already posted this question to the OpenMPI and Beowulf lists, but I 
> also wanted to post this question here to get more Slurm-specific opinions, 
> in case some of you don't subscribe to those lists and have meaning input to 
> provide. For those of you that subscribe to one or more of these lists, I 
> apologize for making you read this a 3rd time.
> 
> We use OpenMPI with Slurm as our scheduler, and a user has asked me this: 
> should they use mpiexec/mpirun or srun to start their MPI jobs through Slurm?
> 
> My inclination is to use mpiexec, since that is the only method that's 
> (somewhat) defined in the MPI standard and therefore the most portable, and 
> the examples in the OpenMPI FAQ use mpirun. However, the Slurm documentation 
> on the schedmd website say to use srun with the --mpi=pmi option. (See links 
> below)
> 
> What are the pros/cons of using these two methods, other than the portability 
> issue I already mentioned? Does srun+pmi use a different method to wire up 
> the connections? Some things I read online seem to indicate that. If slurm 
> was built with PMI support, and OpenMPI was built with Slurm support, does it 
> really make any difference?
> 
> https://www.open-mpi.org/faq/?category=slurm
> https://slurm.schedmd.com/mpi_guide.html#open_mpi
> 
> -- 
> Prentice Bisbal
> Lead Software Engineer
> Princeton Plasma Physics Laboratory
> http://www.pppl.gov
> 
> 




Re: [slurm-users] OpenMPI & Slurm: mpiexec/mpirun vs. srun

2017-12-18 Thread r...@open-mpi.org
If it truly is due to mapping/binding and optimization params, then I would 
expect it to be highly application-specific. The sporadic nature of the reports 
would seem to also support that possibility.

I’d be very surprised to find run time scaling better with srun unless you are 
using some layout option with one that you aren’t using with another. mpiexec 
has all the srun layout options, and a lot more - so I suspect you just aren’t 
using the equivalent mpiexec option. Exploring those might even reveal a 
combination that runs better :-)

Launch time, however, is a different subject.


> On Dec 18, 2017, at 5:23 PM, Christopher Samuel  wrote:
> 
> On 19/12/17 12:13, r...@open-mpi.org wrote:
> 
>> We have had reports of applications running faster when executing under 
>> OMPI’s mpiexec versus when started by srun.
> 
> Interesting, I know that used to be the case with older versions of
> Slurm but since (I think) about 15.x we saw srun scale better than
> mpirun (this was for the molecular dynamics code NAMD).
> 
> -- 
> Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
> 




Re: [slurm-users] [17.11.1] no good pmi intention goes unpunished

2017-12-20 Thread r...@open-mpi.org
On Dec 20, 2017, at 6:21 PM, Philip Kovacs  wrote:
> 
> >  -- slurm.spec: move libpmi to a separate package to solve a conflict with 
> > the
> >version provided by PMIx. This will require a separate change to PMIx as
> >well.
> 
> I see the intention behind this change since the pmix 2.0+ package provides 
> libpmi/libpmi2
> and there is a possible (installation) conflict with the Slurm implementation 
> of those libraries.  
> We've discussed  that issue earlier.
> 
> Now, suppose a user installs the pmix versions of libpmi/pmi2 with the 
> expectation that pmi
> calls will be forwarded to libpmix for greater speed, the so-called "backward 
> compatibility" feature.
> 
> Shouldn't the Slurm mpi_pmi2 plugin attempt to link with libpmi2 instead of 
> its internal 
> implementation of pmi2?  As it stands now, there won't be any forwarding of 
> pmi2 code 
> to libpmix which I imagine users would expect in that scenario.

Sadly, it isn’t quite that simple. Most of the standard PMI2 calls are covered 
by the backward compatibility libraries, so things like MPICH should work 
out-of-the-box.

However, MVAPICH2 added a PMI2 extension call to the SLURM PMI2 library that 
they use and PMIx doesn’t cover (as there really isn’t an easy equivalent, and 
they called it PMIX_foo which causes a naming conflict), and so they would not 
work.




Re: [slurm-users] [17.11.1] no good pmi intention goes unpunished

2017-12-21 Thread r...@open-mpi.org
Hmmm - I think there may be something a little more subtle here. If you build 
your app and link it against “libpmi2”, and that library is actually the one 
from PMIx, then it won’t work with Slurm’s PMI2 plugin because the 
communication protocols are completely different.

So the fact is that if you want to use PMIx backward compatibility, you (a) 
need to link against either libpmix or the libpmi libraries we export (they are 
nothing more than symlinks to libpmix), and (b) specify --mpi=pmix on the srun 
cmd line.



> On Dec 21, 2017, at 11:44 AM, Philip Kovacs  wrote:
> 
> OK, so slurm's libpmi2 is a functional superset of the libpmi2 provided by 
> pmix 2.0+.  That's good to know.
> 
> My point here is that, if you use slurm's mpi/pmi2 plugin, regardless of 
> which libpmi2 is installed, 
> slurm or pmix, you will always run the slurm pmi2 code since it is compiled 
> directly into the plugin.
> 
> 
> On Wednesday, December 20, 2017 10:47 PM, "r...@open-mpi.org" 
>  wrote:
> 
> 
> On Dec 20, 2017, at 6:21 PM, Philip Kovacs  <mailto:pkde...@yahoo.com>> wrote:
>> 
>> >  -- slurm.spec: move libpmi to a separate package to solve a conflict with 
>> > the
>> >version provided by PMIx. This will require a separate change to PMIx as
>> >well.
>> 
>> I see the intention behind this change since the pmix 2.0+ package provides 
>> libpmi/libpmi2
>> and there is a possible (installation) conflict with the Slurm 
>> implementation of those libraries.  
>> We've discussed  that issue earlier.
>> 
>> Now, suppose a user installs the pmix versions of libpmi/pmi2 with the 
>> expectation that pmi
>> calls will be forwarded to libpmix for greater speed, the so-called 
>> "backward compatibility" feature.
>> 
>> Shouldn't the Slurm mpi_pmi2 plugin attempt to link with libpmi2 instead of 
>> its internal 
>> implementation of pmi2?  As it stands now, there won't be any forwarding of 
>> pmi2 code 
>> to libpmix which I imagine users would expect in that scenario.
> 
> Sadly, it isn’t quite that simple. Most of the standard PMI2 calls are 
> covered by the backward compatibility libraries, so things like MPICH should 
> work out-of-the-box.
> 
> However, MVAPICH2 added a PMI2 extension call to the SLURM PMI2 library that 
> they use and PMIx doesn’t cover (as there really isn’t an easy equivalent, 
> and they called it PMIX_foo which causes a naming conflict), and so they 
> would not work.
> 
> 
> 
> 



Re: [slurm-users] [17.11.1] no good pmi intention goes unpunished

2017-12-21 Thread r...@open-mpi.org
I need to correct myself - the libs are not symlinks to libpmix. They are 
actual copies of the libpmix library with their own version triplets which 
change only if/when the PMI-1 or PMI-2 abstraction code changes. If they were 
symlinks, we wouldn’t be able to track independent version triplets.

Just to further clarify: the reason we provide libpmi and libpmi2 is that users 
were requesting access to the backward compatibility feature, but their 
apps/libs were hardcoded to dlopen “libpmi” or “libpmi2”. We suggested they 
just manually create the links, but clearly there was some convenience 
associated with directly installing them. Hence, we added a configure option 
"--enable-pmi-backward-compatibility” to control the behavior and set it to 
enabled by default. Disabling it simply causes the other libs to not be made.


> On Dec 21, 2017, at 12:58 PM, Philip Kovacs  wrote:
> 
> >(they are nothing more than symlinks to libpmix)
> 
> This is very helpful to know.
> 
> 
> On Thursday, December 21, 2017 3:28 PM, "r...@open-mpi.org" 
>  wrote:
> 
> 
> Hmmm - I think there may be something a little more subtle here. If you build 
> your app and link it against “libpmi2”, and that library is actually the one 
> from PMIx, then it won’t work with Slurm’s PMI2 plugin because the 
> communication protocols are completely different.
> 
> So the fact is that if you want to use PMIx backward compatibility, you (a) 
> need to link against either libpmix or the libpmi libraries we export (they 
> are nothing more than symlinks to libpmix), and (b) specify --mpi=pmix on the 
> srun cmd line.
> 
> 
> 
>> On Dec 21, 2017, at 11:44 AM, Philip Kovacs > <mailto:pkde...@yahoo.com>> wrote:
>> 
>> OK, so slurm's libpmi2 is a functional superset of the libpmi2 provided by 
>> pmix 2.0+.  That's good to know.
>> 
>> My point here is that, if you use slurm's mpi/pmi2 plugin, regardless of 
>> which libpmi2 is installed, 
>> slurm or pmix, you will always run the slurm pmi2 code since it is compiled 
>> directly into the plugin.
>> 
>> 
>> On Wednesday, December 20, 2017 10:47 PM, "r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>" mailto:r...@open-mpi.org>> 
>> wrote:
>> 
>> 
>> On Dec 20, 2017, at 6:21 PM, Philip Kovacs > <mailto:pkde...@yahoo.com>> wrote:
>>> 
>>> >  -- slurm.spec: move libpmi to a separate package to solve a conflict 
>>> > with the
>>> >version provided by PMIx. This will require a separate change to PMIx 
>>> > as
>>> >well.
>>> 
>>> I see the intention behind this change since the pmix 2.0+ package provides 
>>> libpmi/libpmi2
>>> and there is a possible (installation) conflict with the Slurm 
>>> implementation of those libraries.  
>>> We've discussed  that issue earlier.
>>> 
>>> Now, suppose a user installs the pmix versions of libpmi/pmi2 with the 
>>> expectation that pmi
>>> calls will be forwarded to libpmix for greater speed, the so-called 
>>> "backward compatibility" feature.
>>> 
>>> Shouldn't the Slurm mpi_pmi2 plugin attempt to link with libpmi2 instead of 
>>> its internal 
>>> implementation of pmi2?  As it stands now, there won't be any forwarding of 
>>> pmi2 code 
>>> to libpmix which I imagine users would expect in that scenario.
>> 
>> Sadly, it isn’t quite that simple. Most of the standard PMI2 calls are 
>> covered by the backward compatibility libraries, so things like MPICH should 
>> work out-of-the-box.
>> 
>> However, MVAPICH2 added a PMI2 extension call to the SLURM PMI2 library that 
>> they use and PMIx doesn’t cover (as there really isn’t an easy equivalent, 
>> and they called it PMIX_foo which causes a naming conflict), and so they 
>> would not work.
>> 
>> 
>> 
>> 
> 
> 
> 



[slurm-users] Using PMIx with SLURM

2018-01-03 Thread r...@open-mpi.org
Hi folks

There have been some recent questions on both this and the OpenMPI mailing 
lists about PMIx use with SLURM. I have tried to capture the various 
conversations in a “how-to” guide on the PMIx web site:

https://pmix.org/support/how-to/slurm-support/ 


There are also some hints about how to debug PMIx-based apps: 
https://pmix.org/support/faq/debugging-pmix/ 


The web site is still in its infancy, so there is still a lot to be added. 
However, it may perhaps begin to help a bit. As always, suggestions are welcome.

Ralph



[slurm-users] Fabric manager interactions: request for comments

2018-02-05 Thread r...@open-mpi.org
I apologize in advance if you received a copy of this from other mailing lists
--

Hello all

The PMIx community is starting work on the next phase of defining support for 
network interactions, looking specifically at things we might want to obtain 
and/or control via the fabric manager. A very preliminary draft is shown here:

https://pmix.org/home/pmix-standard/fabric-manager-roles-and-expectations/ 


We would welcome any comments/suggestions regarding information you might find 
useful to get regarding the network, or controls you would like to set.

Thanks in advance
Ralph




Re: [slurm-users] Allocate more memory

2018-02-07 Thread r...@open-mpi.org
I’m afraid neither of those versions is going to solve the problem here - there 
is no way to allocate memory across nodes.

Simple reason: there is no way for a process to directly address memory on a 
separate node - you’d have to implement that via MPI or shmem or some other 
library.


> On Feb 7, 2018, at 6:57 AM, Loris Bennett  wrote:
> 
> Loris Bennett  > writes:
> 
>> Hi David,
>> 
>> david martin  writes:
>> 
>>>  
>>> 
>>> Hi,
>>> 
>>> I would like to submit a job that requires 3Go. The problem is that I have 
>>> 70 nodes available each node with 2Gb memory.
>>> 
>>> So the command sbatch --mem=3G will wait for ressources to become available.
>>> 
>>> Can I run sbatch and tell the cluster to use the 3Go out of the 70Go
>>> available or is that a particular setup ? meaning is the memory
>>> restricted to each node ? or should i allocate two nodes so that i
>>> have 2x4Go availble ?
>> 
>> Check
>> 
>>  man sbatch
>> 
>> You'll find that --mem means memory per node.  Thus, if you specify 3GB
>> but all the nodes have 2GB, your job will wait forever (or until you buy
>> more RAM and reconfigure Slurm).
>> 
>> You probably want --mem-per-cpu, which is actually more like memory per
>> task.
> 
> The above should read
> 
>  You probably want --mem-per-cpu, which is actually more like memory per
>  core and thus memory per task if you have tasks per core set to 1.
> 
>> This is obviously only going to work if your job can actually run
>> on more than one node, e.g. is MPI enabled.
>> 
>> Cheers,
>> 
>> Loris
> -- 
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de 
> 


Re: [slurm-users] Allocate more memory

2018-02-07 Thread r...@open-mpi.org
Afraid not - since you don’t have any nodes that meet the 3G requirement, 
you’ll just hang.

> On Feb 7, 2018, at 7:01 AM, david vilanova  wrote:
> 
> Thanks for the quick response.
> 
> Should the following script do the trick ?? meaning use all required nodes to 
> have at least 3G total memory ? even though my nodes were setup with 2G each 
> ??
> 
> #SBATCH array 1-10%10:1
> 
> #SBATCH mem-per-cpu=3000m
> 
> srun R CMD BATCH myscript.R
> 
> 
> 
> thanks
> 
> 
> 
> 
> On 07/02/2018 15:50, Loris Bennett wrote:
>> Hi David,
>> 
>> david martin  writes:
>> 
>>> 
>>> 
>>> Hi,
>>> 
>>> I would like to submit a job that requires 3Go. The problem is that I have 
>>> 70 nodes available each node with 2Gb memory.
>>> 
>>> So the command sbatch --mem=3G will wait for ressources to become available.
>>> 
>>> Can I run sbatch and tell the cluster to use the 3Go out of the 70Go
>>> available or is that a particular setup ? meaning is the memory
>>> restricted to each node ? or should i allocate two nodes so that i
>>> have 2x4Go availble ?
>> Check
>> 
>>   man sbatch
>> 
>> You'll find that --mem means memory per node.  Thus, if you specify 3GB
>> but all the nodes have 2GB, your job will wait forever (or until you buy
>> more RAM and reconfigure Slurm).
>> 
>> You probably want --mem-per-cpu, which is actually more like memory per
>> task.  This is obviously only going to work if your job can actually run
>> on more than one node, e.g. is MPI enabled.
>> 
>> Cheers,
>> 
>> Loris
>> 
> 
> 




[slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Greetings,

I am setting up our new GPU cluster, and I seem to have a problem
configuring things so that the devices are properly walled off via
cgroups.  Our nodes each of two GPUS; however, if --gres is unset, or
set to --gres=gpu:0, I can access both GPUs from inside a job.
Moreover, if I ask for just 1 GPU then unset the CUDA_VISIBLE_DEVICES
environmental variable, I can access both GPUs.  From my
understanding, this suggests that it is *not* being protected under
cgroups.

I've read the documentation, and I've read through a number of threads
where people have resolved similar issues.  I've tried a lot of
configurations, but to no avail. Below I include some snippets of
relevant (current) parameters; however, I also am attaching most of
our full conf files.

[slurm.conf]
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
JobAcctGatherType=jobacct_gather/linux
AccountingStorageTRES=gres/gpu
GresTypes=gpu

NodeName=evc1 CPUs=32 RealMemory=191917 Sockets=2 CoresPerSocket=16
ThreadsPerCore=1 State=UNKNOWN NodeAddr=ivc1 Weight=1 Gres=gpu:2

[gres.conf]
NodeName=evc[1-10] Name=gpu File=/dev/nvidia0
COREs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NodeName=evc[1-10] Name=gpu File=/dev/nvidia1
COREs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31

[cgroup.conf]
ConstrainDevices=yes

[cgroup_allowed_devices_file.conf]
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*

Thanks,
Paul.


cgroup_allowed_devices_file.conf
Description: Binary data


cgroup.conf
Description: Binary data


gres.conf
Description: Binary data


slurm.conf
Description: Binary data


Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Thanks Kevin!

Indeed, nvidia-smi in an interactive job tells me that I can get access to
the device when I should not be able to.

I thought including the /dev/nvidia* would whitelist those devices ...
which seems to be the opposite of what I want, no?  Or do I misunderstand?

Thanks,
Paul

On Tue, May 1, 2018, 19:00 Kevin Manalo  wrote:

> Paul,
>
> Having recently set this up, this was my test, when you make a single GPU
> request from inside an interactive run (salloc ... --gres=gpu:1 srun --pty
> bash) request you should only see the GPU assigned to you via 'nvidia-smi'
>
> When gres is unset you should see
>
> nvidia-smi
> No devices were found
>
> Otherwise, if you ask for 1 of 2, you should only see 1 device.
>
> Also, I recall appending this to the bottom of
>
> [cgroup_allowed_devices_file.conf]
> ..
> Same as yours
> ...
> /dev/nvidia*
>
> There was a SLURM bug issue that made this clear, not so much in the
> website docs.
>
> -Kevin
>
>
> On 5/1/18, 5:28 PM, "slurm-users on behalf of R. Paul Wiegand" <
> slurm-users-boun...@lists.schedmd.com on behalf of rpwieg...@gmail.com>
> wrote:
>
> Greetings,
>
> I am setting up our new GPU cluster, and I seem to have a problem
> configuring things so that the devices are properly walled off via
> cgroups.  Our nodes each of two GPUS; however, if --gres is unset, or
> set to --gres=gpu:0, I can access both GPUs from inside a job.
> Moreover, if I ask for just 1 GPU then unset the CUDA_VISIBLE_DEVICES
> environmental variable, I can access both GPUs.  From my
> understanding, this suggests that it is *not* being protected under
> cgroups.
>
> I've read the documentation, and I've read through a number of threads
> where people have resolved similar issues.  I've tried a lot of
> configurations, but to no avail. Below I include some snippets of
> relevant (current) parameters; however, I also am attaching most of
> our full conf files.
>
> [slurm.conf]
> ProctrackType=proctrack/cgroup
> TaskPlugin=task/cgroup
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> JobAcctGatherType=jobacct_gather/linux
> AccountingStorageTRES=gres/gpu
> GresTypes=gpu
>
> NodeName=evc1 CPUs=32 RealMemory=191917 Sockets=2 CoresPerSocket=16
> ThreadsPerCore=1 State=UNKNOWN NodeAddr=ivc1 Weight=1 Gres=gpu:2
>
> [gres.conf]
> NodeName=evc[1-10] Name=gpu File=/dev/nvidia0
> COREs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
> NodeName=evc[1-10] Name=gpu File=/dev/nvidia1
> COREs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
>
> [cgroup.conf]
> ConstrainDevices=yes
>
> [cgroup_allowed_devices_file.conf]
> /dev/null
> /dev/urandom
> /dev/zero
> /dev/sda*
> /dev/cpu/*/*
> /dev/pts/*
>
> Thanks,
> Paul.
>
>
>


Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Thanks Chris.  I do have the ConstrainDevices turned on.  Are the
differences in your cgroup_allowed_devices_file.conf relevant in this case?

On Tue, May 1, 2018, 19:23 Christopher Samuel  wrote:

> On 02/05/18 09:00, Kevin Manalo wrote:
>
> > Also, I recall appending this to the bottom of
> >
> > [cgroup_allowed_devices_file.conf]
> > ..
> > Same as yours
> > ...
> > /dev/nvidia*
> >
> > There was a SLURM bug issue that made this clear, not so much in the
> website docs.
>
> That shouldn't be necessary, all we have for this is..
>
> The relevant line from our cgroup.conf:
>
> [...]
> # Constrain devices via cgroups (to limits access to GPUs etc)
> ConstrainDevices=yes
> [...]
>
> Our entire cgroup_allowed_devices_file.conf:
>
> /dev/null
> /dev/urandom
> /dev/zero
> /dev/sda*
> /dev/cpu/*/*
> /dev/pts/*
> /dev/ram
> /dev/random
> /dev/hfi*
>
>
> This is on RHEL7.
>
> --
>   Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>
>


Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Slurm 17.11.0 on CentOS 7.1

On Tue, May 1, 2018, 19:26 Christopher Samuel  wrote:

> On 02/05/18 09:23, R. Paul Wiegand wrote:
>
> > I thought including the /dev/nvidia* would whitelist those devices
> > ... which seems to be the opposite of what I want, no?  Or do I
> > misunderstand?
>
> No, I think you're right there, we don't have them listed and cgroups
> constrains it correctly (nvidia-smi says no devices when you don't
> request any GPUs).
>
> Which version of Slurm are you on?
>
> cheers,
> Chris
> --
>   Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>
>


Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Yes, I am sure they are all the same.  Typically, I just scontrol reconfig;
however, I have also tried restarting all daemons.

We are moving to 7.4 in a few weeks during our downtime.  We had a QDR ->
OFED version constraint -> Lustre client version constraint issue that
delayed our upgrade.

Should I just wait and test after the upgrade?

On Tue, May 1, 2018, 19:56 Christopher Samuel  wrote:

> On 02/05/18 09:31, R. Paul Wiegand wrote:
>
> > Slurm 17.11.0 on CentOS 7.1
>
> That's quite old (on both fronts, RHEL 7.1 is from 2015), we started on
> that same Slurm release but didn't do the GPU cgroup stuff until a later
> version (17.11.3 on RHEL 7.4).
>
> I don't see anything in the NEWS file about relevant cgroup changes
> though (there is a cgroup affinity fix but that's unrelated).
>
> You do have identical slurm.conf, cgroup.conf,
> cgroup_allowed_devices_file.conf etc on all the compute nodes too?
> Slurmd and slurmctld have both been restarted since they were
> configured?
>
> All the best,
> Chris
> --
>   Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>
>


Re: [slurm-users] GPU / cgroup challenges

2018-05-02 Thread R. Paul Wiegand
I dug into the logs on both the slurmctld side and the slurmd side.
For the record, I have debug2 set for both and
DebugFlags=CPU_BIND,Gres.

I cannot see much that is terribly relevant in the logs.  There's a
known parameter error reported with the memory cgroup specifications,
but I don't think that is germane.

When I set "--gres=gpu:1", the slurmd log does have encouraging lines such as:

[2018-05-02T08:47:04.916] [203.0] debug:  Allowing access to device
/dev/nvidia0 for job
[2018-05-02T08:47:04.916] [203.0] debug:  Not allowing access to
device /dev/nvidia1 for job

However, I can still "see" both devices from nvidia-smi, and I can
still access both if I manually unset CUDA_VISIBLE_DEVICES.

When I do *not* specify --gres at all, there is no reference to gres,
gpu, nvidia, or anything similar in any log at all.  And, of course, I
have full access to both GPUs.

I am happy to attach the snippets of the relevant logs, if someone
more knowledgeable wants to pour through them.  I can also set the
debug level higher, if you think that would help.


Assuming upgrading will solve our problem, in the meantime:  Is there
a way to ensure that the *default* request always has "--gres=gpu:1"?
That is, this situation is doubly bad for us not just because there is
*a way* around the resource management of the device but also because
the *DEFAULT* behavior if a user issues an srun/sbatch without
specifying a Gres is to go around the resource manager.



On Tue, May 1, 2018 at 8:29 PM, Christopher Samuel  wrote:
> On 02/05/18 10:15, R. Paul Wiegand wrote:
>
>> Yes, I am sure they are all the same.  Typically, I just scontrol
>> reconfig; however, I have also tried restarting all daemons.
>
>
> Understood. Any diagnostics in the slurmd logs when trying to start
> a GPU job on the node?
>
>> We are moving to 7.4 in a few weeks during our downtime.  We had a
>> QDR -> OFED version constraint -> Lustre client version constraint
>> issue that delayed our upgrade.
>
>
> I feel your pain..  BTW RHEL 7.5 is out now so you'll need that if
> you need current security fixes.
>
>> Should I just wait and test after the upgrade?
>
>
> Well 17.11.6 will be out then that will include for a deadlock
> that some sites hit occasionally, so that will be worth throwing
> into the mix too.   Do read the RELEASE_NOTES carefully though,
> especially if you're using slurmdbd!
>
>
> All the best,
> Chris
> --
>  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>



Re: [slurm-users] GPU / cgroup challenges

2018-05-21 Thread R. Paul Wiegand
I am following up on this to first thank everyone for their suggestion and also 
let you know that indeed, ugrading from 17.11.0 to 17.11.6 solved the problem.  
Our GPUs are now properly walled off via cgroups per our existing config.

Thanks!

Paul.


> On May 5, 2018, at 9:04 AM, Chris Samuel  wrote:
> 
> On Wednesday, 2 May 2018 11:04:34 PM AEST R. Paul Wiegand wrote:
> 
>> When I set "--gres=gpu:1", the slurmd log does have encouraging lines such
>> as:
>> 
>> [2018-05-02T08:47:04.916] [203.0] debug:  Allowing access to device
>> /dev/nvidia0 for job
>> [2018-05-02T08:47:04.916] [203.0] debug:  Not allowing access to
>> device /dev/nvidia1 for job
>> 
>> However, I can still "see" both devices from nvidia-smi, and I can
>> still access both if I manually unset CUDA_VISIBLE_DEVICES.
> 
> The only thing I can think of is a bug that's been fixed since 17.11.0 (as I 
> know it works for us with 17.11.5) or a kernel bug (or missing device 
> cgroups).
> 
> Sorry I can't be more helpful!
> 
> All the best,
> Chris
> -- 
> Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
> 
> 




[slurm-users] Noob slurm question

2018-12-12 Thread Merritt, Todd R - (tmerritt)
Hi all,
I'm new to slurm. I've used PBS extensively and have set up an 
accounting system that gives groups/account a fixed number of hours per month 
on a per queue/partition basis. It decrements that time allocation with every 
job run and then resets it to the original value at the start of the next 
month. I had hoped that slurm would do this natively but it doesn't seem like 
it does. I came across sbank which sounds like it would implement this but it 
also seems like it would span partitions and not allow separate limits per 
partition. Is this something that has already been implemented or could be done 
in an easier way than what I'm trying?

Thanks,
Todd


Re: [slurm-users] Noob slurm question

2018-12-12 Thread Merritt, Todd R - (tmerritt)
Thanks Thomas,
That's helpful and a bit more tenable than what I thought was 
going to be required. I have a few additional questions. Based on my reading of 
the docs, it seems that GrpTRESmin is set on the account and then each user 
needs to have the partition set there. This brings up a couple of questions for 
me:


  *   How can an account have multiple time GrpTRESmin values for separate 
partitions? I'm guessing those have to be separate accounts then?
  *   All of limits that I applied per queue in pbs are all in qos settings in 
slurm so I could dispense with the additional partitions but I also need to 
limit some classes of jobs to particular sets of nodes and I didn't see any way 
to accomplish that besides partitions.


Thanks again!
Todd

From: slurm-users  on behalf of "Thomas 
M. Payerle" 
Reply-To: Slurm User Community List 
Date: Wednesday, December 12, 2018 at 1:45 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Noob slurm question

Slurm accounting is based on the notion of "associations".  An association is a 
set of cluster, partition, allocation account, and user.  I think most sites do 
the accounting so that it is a single limit applied to all partitions, etc. but 
you can use sacctmgr to apply limits at any association level.  Normally you 
would set GrpTRESmin at the required level.  The GrpTRESmin values apply at the 
association you set them and on all child associations.

So while most sites would do something like e.g.
set GrpTRESmin=cpu=N for allocation acct Acct1
thereby allowing members of Acct1 to use (as a group) N cpu-minutes combined 
across all partitions, you could also do something like
set GrpTRESmin=cpu=N for allocation Acct1 and
set GrpTRESmin=cpu=A for allocation Acct1 and partitionA
set GrpTRESmin=cpu=B for allocation Acct1 and partitionB
In this scenario, users of Acct1 can use at most A cpu-min on partitionA and B 
on paritionB, subject to combined usage on all partitions (A, B, and anything 
else) does not exceed N.

Underneath the covers, Slurm and PBS accounting behave a bit differently --- 
IIRC in PBS you assign "credits" to accounts which then get debited as jobs 
run.  In Slurm, each association tracks usage as jobs run, and you can 
configure limits on the usage at various levels.

The tools for reporting usage of allocation accounts in Slurm leave something 
to be desired; sshare is the underlying tool but not very user friendly, and I 
find sbank leaves a lot to be desired.
I have some Perl libraries interfacing with sshare, etc. on CPAN 
(ihttps://metacpan.org/pod/Slurm::Sshare<http://metacpan.org/pod/Slurm::Sshare>)
 which include a basic sbalance command script.  You would likely need to 
modify the script for your situation (it assumes a situation more like the 
first example above), but that should not be too bad.



On Wed, Dec 12, 2018 at 1:58 PM Merritt, Todd R - (tmerritt) 
mailto:tmerr...@email.arizona.edu>> wrote:
Hi all,
I'm new to slurm. I've used PBS extensively and have set up an 
accounting system that gives groups/account a fixed number of hours per month 
on a per queue/partition basis. It decrements that time allocation with every 
job run and then resets it to the original value at the start of the next 
month. I had hoped that slurm would do this natively but it doesn't seem like 
it does. I came across sbank which sounds like it would implement this but it 
also seems like it would span partitions and not allow separate limits per 
partition. Is this something that has already been implemented or could be done 
in an easier way than what I'm trying?

Thanks,
Todd


--
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu<mailto:paye...@umd.edu>
5825 University Research Park   (301) 405-6135
University of Maryland
College Park, MD 20740-3831


[slurm-users] Enforcing relative resource restrictions in submission script

2024-02-27 Thread Matthew R. Baney via slurm-users
Hello Slurm users,

I'm trying to write a check in our job_submit.lua script that enforces
relative resource requirements such as disallowing more than 4 CPUs or 48GB
of memory per GPU. The QOS itself has a MaxTRESPerJob of
cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to
prevent jobs from "stranding" GPUs, e.g., a 32 CPU/384GB memory job with
only 1 GPU.

I might be missing something obvious, but the rabbit hole I'm going down at
the moment is trying to check all of the different ways job arguments could
be set in the job descriptor.

i.e., the following should all be disallowed:

srun --gres=gpu:1 --mem=49G ... (tres_per_node, mem_per_node set in the
descriptor)

srun --gpus=1 --mem-per-gpu=49G ... (tres_per_job, mem_per_tres)

srun --gres=gpu:1 --ntasks-per-gpu=5 ... (tres_per_node, num_tasks,
ntasks_per_tres)

srun --gpus=1 --ntasks=2 --mem-per-cpu=25G ... (tres_per_job, num_tasks,
mem_per_cpu)

...

Essentially what I'm looking for is a way to access the ReqTRES string from
the job record before it exists, and then run some logic against that i.e.,
if (CPU count / GPU count) > 4 or (mem count / GPU count) > 48G, error out.

Is something like this possible?

Thanks,
Matthew

-- 
Matthew Baney
Assistant Director of Computational Systems
mba...@umd.edu | (301) 405-6756
University of Maryland Institute for Advanced Computer Studies
3154 Brendan Iribe Center
8125 Paint Branch Dr.
College Park, MD 20742

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison

The error message indicates that there are no resources to execute jobs.   
Since you haven’t defined any compute nodes you will get this error.

I would suggest that you create at least one compute node.  Once, you do that 
this error should go away.

Jeff

From: Alison Peterson via slurm-users 
Sent: Tuesday, April 9, 2024 2:52 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Nodes required for job are down, drained or reserved

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.

Hi everyone, I'm conducting some tests. I've just set up SLURM on the head node 
and haven't added any compute nodes yet. I'm trying to test it to ensure it's 
working, but I'm encountering an error: 'Nodes required for the job are DOWN, 
DRAINED, or reserved for jobs in higher priority partitions.

Any guidance will be appreciated thank you!

--
Alison Peterson
IT Research Support Analyst
Information Technology
apeters...@sdsu.edu
O: 619-594-3364
San Diego State University | SDSU.edu
5500 Campanile Drive | San Diego, CA 92182-8080
[https://brand.sdsu.edu/_images/sdsu-monogram-email.png]


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison

Can you provide the output of the following commands:


  *   sinfo
  *   scontrol show node name=head

and the job command that your trying to run?



From: Alison Peterson 
Sent: Tuesday, April 9, 2024 3:03 PM
To: Jeffrey R. Lang 
Cc: slurm-users@lists.schedmd.com
Subject: Re: [EXT] RE: [slurm-users] Nodes required for job are down, drained 
or reserved

Hi Jeffrey,
 I'm sorry I did add the head node in the compute nodes configuration, this is 
the slurm.conf

# COMPUTE NODES
NodeName=head CPUs=24 RealMemory=184000 Sockets=2  CoresPerSocket=6 
ThreadsPerCore=2 State=UNKNOWN
PartitionName=lab  Nodes=ALL Default=YES MaxTime=INFINITE State=UP 
OverSubscribe=Force


On Tue, Apr 9, 2024 at 12:57 PM Jeffrey R. Lang 
mailto:jrl...@uwyo.edu>> wrote:
Alison

The error message indicates that there are no resources to execute jobs.   
Since you haven’t defined any compute nodes you will get this error.

I would suggest that you create at least one compute node.  Once, you do that 
this error should go away.

Jeff

From: Alison Peterson via slurm-users 
mailto:slurm-users@lists.schedmd.com>>
Sent: Tuesday, April 9, 2024 2:52 PM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: [slurm-users] Nodes required for job are down, drained or reserved

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.

Hi everyone, I'm conducting some tests. I've just set up SLURM on the head node 
and haven't added any compute nodes yet. I'm trying to test it to ensure it's 
working, but I'm encountering an error: 'Nodes required for the job are DOWN, 
DRAINED, or reserved for jobs in higher priority partitions.

Any guidance will be appreciated thank you!

--
Alison Peterson
IT Research Support Analyst
Information Technology
apeters...@sdsu.edu<mailto:mfar...@sdsu.edu>
O: 619-594-3364
San Diego State University | SDSU.edu<http://sdsu.edu/>
5500 Campanile Drive | San Diego, CA 92182-8080
[https://brand.sdsu.edu/_images/sdsu-monogram-email.png]



--
Alison Peterson
IT Research Support Analyst
Information Technology
apeters...@sdsu.edu<mailto:mfar...@sdsu.edu>
O: 619-594-3364
San Diego State University | SDSU.edu<http://sdsu.edu/>
5500 Campanile Drive | San Diego, CA 92182-8080
[https://brand.sdsu.edu/_images/sdsu-monogram-email.png]


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison

  The sinfo shows that your head node is down due to come configuration error.

  Are you running slurmd on the head node?  If slurmd, is running find the log 
file for it and pass along the entries from it.

Can you redo the scontrol command and “node name” should be “nodename” one word.

I need to see what’s in the test.sh file to get an idea of how your job is 
setup.

jeff

From: Alison Peterson 
Sent: Tuesday, April 9, 2024 3:15 PM
To: Jeffrey R. Lang 
Cc: slurm-users@lists.schedmd.com
Subject: Re: [EXT] RE: [EXT] RE: [slurm-users] Nodes required for job are down, 
drained or reserved

Yes! here is the information:

[stsadmin@head ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
lab* up   infinite  1  down* head

[stsadmin@head ~]$ scontrol show node name=head
Node name=head not found

[stsadmin@head ~]$ sbatch ~/Downloads/test.sh
Submitted batch job 7

[stsadmin@head ~]$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
 7   lab test_slu stsadmin PD   0:00  1 
(ReqNodeNotAvail, UnavailableNodes:head)

On Tue, Apr 9, 2024 at 1:07 PM Jeffrey R. Lang 
mailto:jrl...@uwyo.edu>> wrote:
Alison

Can you provide the output of the following commands:


· sinfo

· scontrol show node name=head

and the job command that your trying to run?



From: Alison Peterson mailto:apeters...@sdsu.edu>>
Sent: Tuesday, April 9, 2024 3:03 PM
To: Jeffrey R. Lang mailto:jrl...@uwyo.edu>>
Cc: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: Re: [EXT] RE: [slurm-users] Nodes required for job are down, drained 
or reserved

Hi Jeffrey,
 I'm sorry I did add the head node in the compute nodes configuration, this is 
the slurm.conf

# COMPUTE NODES
NodeName=head CPUs=24 RealMemory=184000 Sockets=2  CoresPerSocket=6 
ThreadsPerCore=2 State=UNKNOWN
PartitionName=lab  Nodes=ALL Default=YES MaxTime=INFINITE State=UP 
OverSubscribe=Force


On Tue, Apr 9, 2024 at 12:57 PM Jeffrey R. Lang 
mailto:jrl...@uwyo.edu>> wrote:
Alison

The error message indicates that there are no resources to execute jobs.   
Since you haven’t defined any compute nodes you will get this error.

I would suggest that you create at least one compute node.  Once, you do that 
this error should go away.

Jeff

From: Alison Peterson via slurm-users 
mailto:slurm-users@lists.schedmd.com>>
Sent: Tuesday, April 9, 2024 2:52 PM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: [slurm-users] Nodes required for job are down, drained or reserved

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.

Hi everyone, I'm conducting some tests. I've just set up SLURM on the head node 
and haven't added any compute nodes yet. I'm trying to test it to ensure it's 
working, but I'm encountering an error: 'Nodes required for the job are DOWN, 
DRAINED, or reserved for jobs in higher priority partitions.

Any guidance will be appreciated thank you!

--
Alison Peterson
IT Research Support Analyst
Information Technology
apeters...@sdsu.edu<mailto:mfar...@sdsu.edu>
O: 619-594-3364
San Diego State University | SDSU.edu<http://sdsu.edu/>
5500 Campanile Drive | San Diego, CA 92182-8080
[https://brand.sdsu.edu/_images/sdsu-monogram-email.png]



--
Alison Peterson
IT Research Support Analyst
Information Technology
apeters...@sdsu.edu<mailto:mfar...@sdsu.edu>
O: 619-594-3364
San Diego State University | SDSU.edu<http://sdsu.edu/>
5500 Campanile Drive | San Diego, CA 92182-8080
[https://brand.sdsu.edu/_images/sdsu-monogram-email.png]



--
Alison Peterson
IT Research Support Analyst
Information Technology
apeters...@sdsu.edu<mailto:mfar...@sdsu.edu>
O: 619-594-3364
San Diego State University | SDSU.edu<http://sdsu.edu/>
5500 Campanile Drive | San Diego, CA 92182-8080
[https://brand.sdsu.edu/_images/sdsu-monogram-email.png]


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison

  In your case since you are using head as both a slurm management node and a 
compute node you’ll need to setup slurmd on the head node.

Once the slurmd is running use “sinfo” to see what the status of the node is.  
Most likely down hopefully without an astrick.  If that’s the case then use

scontrol update node=head state=resume

and then check the status again.  Hopwfully the node with show idle meaning 
that it’s should be ready to accept jobs.


Jeff

From: Alison Peterson 
Sent: Tuesday, April 9, 2024 3:40 PM
To: Jeffrey R. Lang 
Cc: slurm-users@lists.schedmd.com
Subject: Re: [EXT] RE: [EXT] RE: [EXT] RE: [slurm-users] Nodes required for job 
are down, drained or reserved

Aha! That is probably the issue slurmd ! I know slurmd runs on the compute 
nodes, I need to deploy this for a lab but I only have one of the servers with 
me. I will be adding them 1 by 1 after the first one is set up, to not disrupt 
their current setup. I want to be able to use the resources from the head and 
also the compute nodes once it's completed.

[stsadmin@head ~]$ sudo systemctl status slurmd
Unit slurmd.service could not be found.

[stsadmin@head ~]$ scontrol show node head
NodeName=head CoresPerSocket=6
   CPUAlloc=0 CPUEfctv=24 CPUTot=24 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=head NodeHostName=head
   RealMemory=184000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A 
MCS_label=N/A
   Partitions=lab
   BootTime=None SlurmdStartTime=None
   LastBusyTime=2024-04-09T13:20:04 ResumeAfterTime=None
   CfgTRES=cpu=24,mem=184000M,billing=24
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Not responding [slurm@2024-04-09T10:14:10]

[stsadmin@head ~]$ cat ~/Downloads/test.sh
#!/bin/bash
#SBATCH --job-name=test_slurm
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --output=test_slurm_output.txt

echo "Starting the SLURM test job on: $(date)"
echo "Running on hostname: $(hostname)"
echo "SLURM_JOB_ID: $SLURM_JOB_ID"
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "SLURM_NTASKS: $SLURM_NTASKS"

# Here you can place the commands you want to run on the compute node
# For example, a simple sleep command or any application that needs to be tested
sleep 60

echo "SLURM test job completed on: $(date)"

On Tue, Apr 9, 2024 at 1:21 PM Jeffrey R. Lang 
mailto:jrl...@uwyo.edu>> wrote:
Alison

  The sinfo shows that your head node is down due to come configuration error.

  Are you running slurmd on the head node?  If slurmd, is running find the log 
file for it and pass along the entries from it.

Can you redo the scontrol command and “node name” should be “nodename” one word.

I need to see what’s in the test.sh file to get an idea of how your job is 
setup.

jeff

From: Alison Peterson mailto:apeters...@sdsu.edu>>
Sent: Tuesday, April 9, 2024 3:15 PM
To: Jeffrey R. Lang mailto:jrl...@uwyo.edu>>
Cc: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: Re: [EXT] RE: [EXT] RE: [slurm-users] Nodes required for job are down, 
drained or reserved

Yes! here is the information:

[stsadmin@head ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
lab* up   infinite  1  down* head

[stsadmin@head ~]$ scontrol show node name=head
Node name=head not found

[stsadmin@head ~]$ sbatch ~/Downloads/test.sh
Submitted batch job 7

[stsadmin@head ~]$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
 7   lab test_slu stsadmin PD   0:00  1 
(ReqNodeNotAvail, UnavailableNodes:head)

On Tue, Apr 9, 2024 at 1:07 PM Jeffrey R. Lang 
mailto:jrl...@uwyo.edu>> wrote:
Alison

Can you provide the output of the following commands:


• sinfo

• scontrol show node name=head

and the job command that your trying to run?



From: Alison Peterson mailto:apeters...@sdsu.edu>>
Sent: Tuesday, April 9, 2024 3:03 PM
To: Jeffrey R. Lang mailto:jrl...@uwyo.edu>>
Cc: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: Re: [EXT] RE: [slurm-users] Nodes required for job are down, drained 
or reserved

Hi Jeffrey,
 I'm sorry I did add the head node in the compute nodes configuration, this is 
the slurm.conf

# COMPUTE NODES
NodeName=head CPUs=24 RealMemory=184000 Sockets=2  CoresPerSocket=6 
ThreadsPerCore=2 State=UNKNOWN
PartitionName=lab  Nodes=ALL Default=YES MaxTime=INFINITE State=UP 
OverSubscribe=Force


On Tue, Apr 9, 2024 at 12:57 PM Jeffrey R. Lang 
mailto:jrl...@uwyo.edu>> wrote:
Alison

The error message indicates that there are no resources to execute jobs.   
Since you haven’t defined any compute nodes you will get this error.

I would sugges

[slurm-users] Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: Nodes required for job are down, drained or reserved

2024-04-09 Thread Jeffrey R. Lang via slurm-users
Alison

  I’m glad I was able to help.  Good luck.

Jeff

From: Alison Peterson 
Sent: Tuesday, April 9, 2024 4:09 PM
To: Jeffrey R. Lang 
Cc: slurm-users@lists.schedmd.com
Subject: Re: [EXT] RE: [EXT] RE: [EXT] RE: [EXT] RE: [slurm-users] Nodes 
required for job are down, drained or reserved

Thank you so much!!! I have installed slurmd on the head node. Started and 
enabled the service, restarted slurmctld. I sent 2 jobs and they are running!

[stsadmin@head ~]$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
10   lab test_slu stsadmin  R   0:01  1 head
 9   lab test_slu stsadmin  R   0:09  1 head

On Tue, Apr 9, 2024 at 1:54 PM Jeffrey R. Lang 
mailto:jrl...@uwyo.edu>> wrote:
Alison

  In your case since you are using head as both a slurm management node and a 
compute node you’ll need to setup slurmd on the head node.

Once the slurmd is running use “sinfo” to see what the status of the node is.  
Most likely down hopefully without an astrick.  If that’s the case then use

scontrol update node=head state=resume

and then check the status again.  Hopwfully the node with show idle meaning 
that it’s should be ready to accept jobs.


Jeff

From: Alison Peterson mailto:apeters...@sdsu.edu>>
Sent: Tuesday, April 9, 2024 3:40 PM
To: Jeffrey R. Lang mailto:jrl...@uwyo.edu>>
Cc: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: Re: [EXT] RE: [EXT] RE: [EXT] RE: [slurm-users] Nodes required for job 
are down, drained or reserved

Aha! That is probably the issue slurmd ! I know slurmd runs on the compute 
nodes, I need to deploy this for a lab but I only have one of the servers with 
me. I will be adding them 1 by 1 after the first one is set up, to not disrupt 
their current setup. I want to be able to use the resources from the head and 
also the compute nodes once it's completed.

[stsadmin@head ~]$ sudo systemctl status slurmd
Unit slurmd.service could not be found.

[stsadmin@head ~]$ scontrol show node head
NodeName=head CoresPerSocket=6
   CPUAlloc=0 CPUEfctv=24 CPUTot=24 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=head NodeHostName=head
   RealMemory=184000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A 
MCS_label=N/A
   Partitions=lab
   BootTime=None SlurmdStartTime=None
   LastBusyTime=2024-04-09T13:20:04 ResumeAfterTime=None
   CfgTRES=cpu=24,mem=184000M,billing=24
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Not responding [slurm@2024-04-09T10:14:10]

[stsadmin@head ~]$ cat ~/Downloads/test.sh
#!/bin/bash
#SBATCH --job-name=test_slurm
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --output=test_slurm_output.txt

echo "Starting the SLURM test job on: $(date)"
echo "Running on hostname: $(hostname)"
echo "SLURM_JOB_ID: $SLURM_JOB_ID"
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "SLURM_NTASKS: $SLURM_NTASKS"

# Here you can place the commands you want to run on the compute node
# For example, a simple sleep command or any application that needs to be tested
sleep 60

echo "SLURM test job completed on: $(date)"

On Tue, Apr 9, 2024 at 1:21 PM Jeffrey R. Lang 
mailto:jrl...@uwyo.edu>> wrote:
Alison

  The sinfo shows that your head node is down due to come configuration error.

  Are you running slurmd on the head node?  If slurmd, is running find the log 
file for it and pass along the entries from it.

Can you redo the scontrol command and “node name” should be “nodename” one word.

I need to see what’s in the test.sh file to get an idea of how your job is 
setup.

jeff

From: Alison Peterson mailto:apeters...@sdsu.edu>>
Sent: Tuesday, April 9, 2024 3:15 PM
To: Jeffrey R. Lang mailto:jrl...@uwyo.edu>>
Cc: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: Re: [EXT] RE: [EXT] RE: [slurm-users] Nodes required for job are down, 
drained or reserved

Yes! here is the information:

[stsadmin@head ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
lab* up   infinite  1  down* head

[stsadmin@head ~]$ scontrol show node name=head
Node name=head not found

[stsadmin@head ~]$ sbatch ~/Downloads/test.sh
Submitted batch job 7

[stsadmin@head ~]$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
 7   lab test_slu stsadmin PD   0:00  1 
(ReqNodeNotAvail, UnavailableNodes:head)

On Tue, Apr 9, 2024 at 1:07 PM Jeffrey R. Lang 
mailto:jrl...@uwyo.edu>> wrote:
Alison

Can you provide the output of the following commands:


• sinfo

• scontrol show node name=head

and the job command that your trying

[slurm-users] Optimizing CPU socket affinities and NVLink

2024-08-08 Thread Matthew R. Baney via slurm-users
Hello,

I've recently adopted setting AutoDetect=nvml in our GPU nodes' gres.conf
files to automatically populate Cores and Links for GPUs, which has been
working well.

I'm now wondering if I can prioritize having single GPU jobs scheduled on
NVLink pairs (these are PCIe A6000s) where one of the GPUs in the pair is
already running a single GPU job, assuming the CPU socket with affinity has
enough cores to handle the job. We have some users wanting to run single
GPU jobs and others wanting to run dual GPU jobs, both on the same nodes,
so we would prefer not to configure each NVLink pair as a single GRES, for
better job throughput.

As is, I've observed that for a node with at least 4 GPUs and 2 sockets
(one NVLink pair per socket), Slurm will prioritize evening out core
allocation between the sockets. Once the second single GPU job is
submitted, one GPU in each NVLink pair is taken up and a subsequent dual
GPU job can still run, but doesn't have access to an NVLink pair.

We've also got a few nodes where single GPUs have failed, resulting in some
NVLink'd pairs and usually a single non-NVLink'd GPU (3 or 7 total GPUs).
It'd be ideal if single GPU jobs also got prioritized for scheduling on the
non-NVLink'd GPU in this case.

Is this possible?

All the best,
Matthew

-- 
Matthew Baney
Assistant Director of Computational Systems
mba...@umd.edu | (301) 405-6756
University of Maryland Institute for Advanced Computer Studies
3154 Brendan Iribe Center
8125 Paint Branch Dr.
College Park, MD 20742

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Node configuration unavailable when using --mem-per-gpu , for specific GPU type

2024-12-13 Thread Matthew R. Baney via slurm-users
Hi all,

I'm seeing some odd behavior when using the --mem-per-gpu flag instead of
the --mem flag to request memory when also requesting all available CPUs on
a node (in this example, all available nodes have 32 CPUs):

$ srun --ntasks-per-node=8 --cpus-per-task=4 --gpus-per-node=gtx1080ti:1
--mem-per-gpu=1g --pty bash
srun: error: Unable to allocate resources: Requested node configuration is
not available

$ srun --ntasks-per-node=8 --cpus-per-task=4 --gpus-per-node=gtx1080ti:1
--mem=1g --pty bash
srun: job 3479971 queued and waiting for resources
srun: job 3479971 has been allocated resources
$

The nodes in this partition have a mix of gtx1080ti and rtx2080ti GPUs, but
only one type of GPU is in any one node. The same behavior does not occur
when requesting a (node with a) rtx2080ti instead.

Is there something I'm missing that would cause the --mem-per-gpu flag to
not be working in this example?

Thanks,
Matthew

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm not running on a warewulf node

2024-12-03 Thread Jeffrey R. Lang via slurm-users
Steve

  Trying running the failing process from the command line and use the -D 
option.

Per man page: Run slurmd in the foreground. Error and debug messages will be 
copied to stderr.

Jeffrey R. Lang
Advanced Research Computing Center
University of Wyoming, Information Technology Center
1000 E. University Ave
Laramie,  WY 82071

Email: jrl...@uwyo.edu
Work: 307.766.3381

From: Steven Jones via slurm-users 
Sent: Tuesday, December 3, 2024 5:39 PM
To: slurm-us...@schedmd.com
Subject: [slurm-users] slurm not running on a warewulf node

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.

Hi,

I have set a log creation/location in slurm.conf   as,

SlurmdLogFile=/var/log/slurm/slurmd.log

But it is 0 length.

Slurm will not run, what else do I need to do to log why its failing pls?



regards

Steven

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com