Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread David
I would think that slurm would only filter it out, potentially, if the
partition in question (b4) was marked as "hidden" and only accessible by
the correct account.

On Thu, Sep 21, 2023 at 3:11 AM Diego Zuccato 
wrote:

> Hello all.
>
> We have one partition (b4) that's reserved for an account while the
> others are "free for all".
> The problem is that
> sbatch --partition=b1,b2,b3,b4,b5 test.sh
> fails with
> sbatch: error: Batch job submission failed: Invalid account or
> account/partition combination specified
> while
> sbatch --partition=b1,b2,b3,b5 test.sh
> succeeds.
>
> Shouldn't Slurm (22.05.6) just "filter out" the inaccessible partition,
> considering only the others?
> Just like what it does if I'm requesting more cores than available on a
> node.
>
> I'd really like to avoid having to replicate scheduler logic in
> job_submit.lua... :)
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>
>

-- 
David Rhey
---
Advanced Research Computing
University of Michigan


Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread David
Slurm is working as it should. From your own examples you proved that; by
not submitting to b4 the job works. However, looking at man sbatch:

   -p, --partition=
  Request  a  specific partition for the resource allocation.
If not specified, the default behavior is to allow the slurm controller to
select
  the default partition as designated by the system
administrator. If the job can use more than one partition, specify their
names  in  a  comma
  separate  list and the one offering earliest initiation will
be used with no regard given to the partition name ordering (although
higher pri‐
  ority partitions will be considered first).  When the job is
initiated, the name of the partition used will be placed first in the job
 record
  partition string.

In your example, the job can NOT use more than one partition (given the
restrictions defined on the partition itself precluding certain accounts
from using it). This, to me, seems either like a user education issue (i.e.
don't have them submit to every partition), or you can try the job submit
lua route - or perhaps the hidden partition route (which I've not tested).

On Thu, Sep 21, 2023 at 9:18 AM Diego Zuccato 
wrote:

> Uh? It's not a problem if other users see there are jobs in the
> partition (IIUC it's what 'hidden' is for), even if they can't use it.
>
> The problem is that if it's included in --partition it prevents jobs
> from being queued!
> Nothing  in the documentation about --partition made me think that
> forbidding access to one partition would make a job unqueueable...
>
> Diego
>
> Il 21/09/2023 14:41, David ha scritto:
> > I would think that slurm would only filter it out, potentially, if the
> > partition in question (b4) was marked as "hidden" and only accessible by
> > the correct account.
> >
> > On Thu, Sep 21, 2023 at 3:11 AM Diego Zuccato  > <mailto:diego.zucc...@unibo.it>> wrote:
> >
> > Hello all.
> >
> > We have one partition (b4) that's reserved for an account while the
> > others are "free for all".
> > The problem is that
> > sbatch --partition=b1,b2,b3,b4,b5 test.sh
> > fails with
> > sbatch: error: Batch job submission failed: Invalid account or
> > account/partition combination specified
> > while
> > sbatch --partition=b1,b2,b3,b5 test.sh
> > succeeds.
> >
> > Shouldn't Slurm (22.05.6) just "filter out" the inaccessible
> partition,
> > considering only the others?
> > Just like what it does if I'm requesting more cores than available
> on a
> > node.
> >
> > I'd really like to avoid having to replicate scheduler logic in
> >     job_submit.lua... :)
> >
> > --
> > Diego Zuccato
> > DIFA - Dip. di Fisica e Astronomia
> > Servizi Informatici
> > Alma Mater Studiorum - Università di Bologna
> > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> > tel.: +39 051 20 95786
> >
> >
> >
> > --
> > David Rhey
> > ---
> > Advanced Research Computing
> > University of Michigan
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>
>

-- 
David Rhey
---
Advanced Research Computing
University of Michigan


Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread David
 That's not at all how I interpreted this man page description.  By "If the
job can use more than..." I thought it was completely obvious (although
perhaps wrong, if your interpretation is correct, but it never crossed my
mind) that it referred to whether the _submitting user_ is OK with it using
more than one partition. The partition where the user is forbidden (because
of the partition's allowed account) should just be _not_ the earliest
initiation (because it'll never initiate there), and therefore not run
there, but still be able to run on the other partitions listed in the batch
script.

> that's fair. I was considering this only given the fact that we know the
user doesn't have access to a partition (this isn't the surprise here) and
that slurm communicates that as the reason pretty clearly. I can see how if
a user is submitting against multiple partitions they might hope that if a
job couldn't run in a given partition, given the number of others provided,
the scheduler might consider all of those *before* dying outright at the
first rejection.

On Thu, Sep 21, 2023 at 10:28 AM Bernstein, Noam CIV USN NRL (6393)
Washington DC (USA)  wrote:

> On Sep 21, 2023, at 9:46 AM, David  wrote:
>
> Slurm is working as it should. From your own examples you proved that; by
> not submitting to b4 the job works. However, looking at man sbatch:
>
>-p, --partition=
>   Request  a  specific partition for the resource allocation.
> If not specified, the default behavior is to allow the slurm controller to
> select
>   the default partition as designated by the system
> administrator. If the job can use more than one partition, specify their
> names  in  a  comma
>   separate  list and the one offering earliest initiation will
> be used with no regard given to the partition name ordering (although
> higher pri‐
>   ority partitions will be considered first).  When the job is
> initiated, the name of the partition used will be placed first in the job
>  record
>   partition string.
>
> In your example, the job can NOT use more than one partition (given the
> restrictions defined on the partition itself precluding certain accounts
> from using it). This, to me, seems either like a user education issue (i.e.
> don't have them submit to every partition), or you can try the job submit
> lua route - or perhaps the hidden partition route (which I've not tested).
>
>
> That's not at all how I interpreted this man page description.  By "If the
> job can use more than..." I thought it was completely obvious (although
> perhaps wrong, if your interpretation is correct, but it never crossed my
> mind) that it referred to whether the _submitting user_ is OK with it using
> more than one partition. The partition where the user is forbidden (because
> of the partition's allowed account) should just be _not_ the earliest
> initiation (because it'll never initiate there), and therefore not run
> there, but still be able to run on the other partitions listed in the batch
> script.
>
> I think it's completely counter-intuitive that submitting saying it's OK
> to run on one of a few partitions, and one partition happening to be
> forbidden to the submitting user, means that it won't run at all.  What if
> you list multiple partitions, and increase the number of nodes so that
> there aren't enough in one of the partitions, but not realize this
> problem?  Would you expect that to prevent the job from ever running on any
> partition?
>
> Noam
>


-- 
David Rhey
---
Advanced Research Computing
University of Michigan


Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-28 Thread David
A colleague of mine has it scripted out quite well, so I can't speak to
*all* of the details. However, we have a user that we submit our jobs as
and it does the steps for upgrading (yum, dnf, etc). The jobs are
wholenode/exclusive so nothing else can run there, and then a few other
steps might be taken (node reboots etc). I think we might have some level
of reservation in there so nodes can drain (which would help expedite the
situation a bit but it still would depend on your longest running job).

This has worked well for . releases/patches and effectively behaves like a
rolling upgrade. Yours might even be easier/quicker since it's symlinks
(which is SchedMD's preferred method, iirc). Speaking of which, I believe
one of the SchedMD folks gave some pointers on that in the past, perhaps in
a presentation at SLUG. So you could peruse there, as well.



On Thu, Sep 28, 2023 at 12:04 PM Groner, Rob  wrote:

>
> There's 14 steps to upgrading slurm listed on their website, including
> shutting down and backing up the database.  So far we've only updated slurm
> during a downtime, and it's been a major version change, so we've taken all
> the steps indicated.
>
> We now want to upgrade from 23.02.4 to 23.02.5.
>
> Our slurm builds end up in version named directories, and we tell
> production which one to use via symlink.  Changing the symlink will
> automatically change it on our slurm controller node and all slurmd nodes.
>
> Is there an expedited, simple, slimmed down upgrade path to follow if
> we're looking at just a . level upgrade?
>
> Rob
>
>

-- 
David Rhey
---
Advanced Research Computing
University of Michigan


Re: [slurm-users] TRES sreport per association

2023-11-16 Thread David
Hello,

Perhaps`scontrol show assoc` might be able to help you here, in part? Or
even sshare. Those would be the raw usage numbers, if I remember correctly.
But it might help get you some insight as to usages (though not analogous
to what sreport would show). As a note: `scontrol show assoc` will be very
lengthy output.

HTH,

David

On Sun, Nov 12, 2023 at 6:03 PM Kamil Wilczek  wrote:

> Dear All,
>
> is is possible to report GPU Minutes per association? Suppose
> I have two associations like this:
>
>sacctmgr show assoc where user=$(whoami)
> format=account%10,user%16,partition%12,qos%12,grptresmins%20
>
> Account UserPartition  QOS  GrpTRESMins
> --    
>   staffkmwil  gpu_adv   1gpu1d   gres/gpu=1
>   staffkmwil   common   4gpu4d gres/gpu=100
>
> When I run "sreport" I get (I think) the cumulative report. There
> is no "association" option for the "--format" flag for "sreport".
>
> In my setup I divide the cluster using GPU generations. Older
> cards, like TITAN V are accessible for all users (a common
> partition), but, for example, a partition with nodes with A100
> is accessible only for selected users.
>
> Each user gets a QoS ("4gpu4d" means that a user can allocate
> 4 GPUs at most and a single job time limit is 4 days). Each
> user is also limited to a number of GPUMinutes for each
> association and it would be nice to know how many minutes
> are left per assoc.
>
> Kind regards
> --
> Kamil Wilczek [https://keys.openpgp.org/]
> [6C4BE20A90A1DBFB3CBE2947A832BF5A491F9F2A]
>


-- 
David Rhey
---
Advanced Research Computing
University of Michigan


[slurm-users] "command not found"

2017-12-15 Thread david

Hi,

when running a sbatch script i get "command not found".


The command is blast (quite used bioinformatics tool).


The problem comes from the fact that the blast binary is installed in 
the master node  but not on the other nodes. When the job runs on 
another node the binary is not found.



What would be way to deal with this situation ? what is common practice ?


thanks,

david





[slurm-users] External provisioning for accounts and other things (?)

2018-09-18 Thread David Rhey
Hello, All,

First time caller, long-time listener. Does anyone use any sort of external
tool (e.g. a form submission) that generates accounts for their Slurm
environment (notably for new accounts/allocations)? An example of this
would be: a group or user needs us to provision resources for them to run
on and so they submit a form to us with information on their needs and we
provision for them.

If anyone is using external utilities, are they manually putting that in or
are they leveraging Slurm APIs to do this? It's a long shot, but if anyone
is doing this with ServiceNow, I'd be extra interested in how you achieved
that.

Thanks!

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


Re: [slurm-users] External provisioning for accounts and other things (?)

2018-09-18 Thread David Rhey
Thanks!! We're currently in a similar boat where things are provisioned in
a form, and we receive an email and act on that information with scripts
and some text expansion. We were wondering whether or not some tighter
integration was possible - but that'd be a feature down the road as we'd
want to be sure the process was predictable.

On Tue, Sep 18, 2018 at 4:04 PM Thomas M. Payerle  wrote:

> We make use of an large home grown library of Perl scripts this for
> creating allocations, creating users, adding users to allocations, etc.
>
> We have a number of "flavors" of allocations, but most allocation
> creation/disabling activity occurs with respect to applications for
> allocations which are reviewed by a faculty committee, and although
> percentage of applications approved is rather high, it is not automatic
> (and many involve requesting the applicant to elaborate or provide
> additional information).  While we are in the process of migrating the
> "application process" to ServiceNow, it will only be as the web form
> backend and to track the applications, votes/comments of the faculty
> committee, etc.  The actual creation of allocations, etc. is all done
> manually, albeit by simply invoking a single script or two with a handful
> of parameters.  The scripts take care of all the Unixy and Slurm tasks
> required to create the allocation, etc., as well as sending the standard
> welcome email to the allocation
> "owner",  updating local DBs about the new allocation, etc., and keeping a
> log a what was done and why (i.e. linking the action to the specific
> application).  Scripts exist for
> a variety of standard tasks, both high and low level.
>
> A couple of the underlying libraries (Perl wrappers around sacctmgr and
> sshare commands) are available on CPAN (Slurm::Sacctmgr, Slurm::Sshare);
> the rest lack the polish and finish required for publishing on CPAN.
>
> On Tue, Sep 18, 2018 at 3:02 PM David Rhey  wrote:
>
>> Hello, All,
>>
>> First time caller, long-time listener. Does anyone use any sort of
>> external tool (e.g. a form submission) that generates accounts for their
>> Slurm environment (notably for new accounts/allocations)? An example of
>> this would be: a group or user needs us to provision resources for them to
>> run on and so they submit a form to us with information on their needs and
>> we provision for them.
>>
>> If anyone is using external utilities, are they manually putting that in
>> or are they leveraging Slurm APIs to do this? It's a long shot, but if
>> anyone is doing this with ServiceNow, I'd be extra interested in how you
>> achieved that.
>>
>> Thanks!
>>
>> --
>> David Rhey
>> ---
>> Advanced Research Computing - Technology Services
>> University of Michigan
>>
>
>
> --
> Tom Payerle
> DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu
> 5825 University Research Park   (301) 405-6135
> University of Maryland
> College Park, MD 20740-3831
>


-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


Re: [slurm-users] External provisioning for accounts and other things (?)

2018-09-19 Thread David Rhey
Thanks! I'll check this out. Ya'll are awesome for the responses.

On Wed, Sep 19, 2018 at 7:57 AM Chris Samuel  wrote:

> On Wednesday, 19 September 2018 5:00:58 AM AEST David Rhey wrote:
>
> > First time caller, long-time listener. Does anyone use any sort of
> external
> > tool (e.g. a form submission) that generates accounts for their Slurm
> > environment (notably for new accounts/allocations)? An example of this
> > would be: a group or user needs us to provision resources for them to run
> > on and so they submit a form to us with information on their needs and we
> > provision for them.
>
> The Karaage cluster management software that was originally written by
> folks
> at ${JOB-2} and which we used with Slurm at ${JOB-1} does all this.  I'm
> not
> sure how actively maintained it is (as we have our own system at ${JOB}),
> but
> it's on Github here:
>
> https://github.com/Karaage-Cluster/karaage/
>
> The Python code that handles the Slurm side of things is here:
>
>
> https://github.com/Karaage-Cluster/karaage/blob/master/karaage/datastores/slurm.py
>
> Hope that helps!
>
> All the best,
> Chris
> --
>  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>
>
>
>
>

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


[slurm-users] Priority access for a group of users

2019-02-15 Thread David Baker
Hello.


We have a small set of compute nodes owned by a group. The group has agreed 
that the rest of the HPC community can use these nodes providing that they (the 
owners) can always have priority access to the nodes. The four nodes are well 
provisioned (1 TByte memory each plus 2 GRID K2 graphics cards) and so there is 
no need to worry about preemption. In fact I'm happy for the nodes to be used 
as well as possible by all users. It's just that jobs from the owners must take 
priority if resources are scarce.


What is the best way to achieve the above in slurm? I'm planning to place the 
nodes in their own partition. The node owners will have priority access to the 
nodes in that partition, but will have no advantage when submitting jobs to the 
public resources. Does anyone please have any ideas how to deal with this?


Best regards,

David



Re: [slurm-users] How to request ONLY one CPU instead of one socket or one node?

2019-02-15 Thread David Rhey
Hello,

Are you sure you're NOT getting 1 CPU when you run your job? You might want
to put some echo logic into your job to look at Slurm env variables of the
node your job lands on as a way of checking. E.g.:

echo $SLURM_CPUS_ON_NODE
echo $SLURM_JOB_CPUS_PER_NODE

I don't see anything wrong with your script. As a test I took the basic
parameters you've outlined and ran an interactive `srun` session,
requesting 1 CPU per task and 4 CPUs per task, and then looked at the
aforementioned variable output within each session. For example, requesting
1 CPU per task:

[drhey@beta-login ~]$ srun --cpus-per-task=1 --ntasks-per-node=1
--partition=standard --mem=1G --pty bash
[drhey@bn19 ~]$ echo $SLURM_CPUS_ON_NODE
1

And again, running this command now asking for 4 CPUs per task and then
echoing the env var:

[drhey@beta-login ~]$ srun --cpus-per-task=4 --ntasks-per-node=1
--partition=standard --mem=1G --pty bash
[drhey@bn19 ~]$ echo $SLURM_CPUS_ON_NODE
4

HTH!

David

On Wed, Feb 13, 2019 at 9:24 PM Wang, Liaoyuan  wrote:

> Dear there,
>
>
>
> I wrote an analytic program to analyze my data. The analysis costs around
> twenty days to analyze all data for one species. When I submit my job to
> the cluster, it always request one node instead of one CPU. I am wondering
> how I can ONLY request one CPU using “sbatch” command? Below is my batch
> file. Any comments and help would be highly appreciated.
>
>
>
> Appreciatively,
>
> Leon
>
> 
>
> #!/bin/sh
>
> #SBATCH --ntasks=1
> #SBATCH --cpus-per-task=1
> #SBATCH -t 45-00:00:00
> #SBATCH -J 9625%j
> #SBATCH -o 9625.out
> #SBATCH -e 9625.err
>
> /home/scripts/wcnqn.auto.pl
>
> =======
>
> Where wcnqn.auto.pl is my program. 9625 denotes the species number.
>
>
>


-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


Re: [slurm-users] Priority access for a group of users

2019-02-15 Thread david baker
Hi Paul, Marcus,

Thank you for your replies. Using partition priority all makes sense. I was
thinking of doing something similar with a set of nodes purchased by
another group. That is, having a private high priority partition and a
lower priority "scavenger" partition for the public. In this case scavenger
jobs will get killed when preempted.

In the present case , I did wonder if it would be possible to do something
with just a single partition -- hence my question.Your replies have
convinced me that two partitions will work -- with preemption leading to
re-queued jobs.

Best regards,
David

On Fri, Feb 15, 2019 at 3:09 PM Paul Edmon  wrote:

> Yup, PriorityTier is what we use to do exactly that here.  That said
> unless you turn on preemption jobs may still pend if there is no space.  We
> run with REQUEUE on which has worked well.
>
>
> -Paul Edmon-
>
>
> On 2/15/19 7:19 AM, Marcus Wagner wrote:
>
> Hi David,
>
> as far as I know, you can use the PriorityTier (partition parameter) to
> achieve this. According to the manpages (if I remember right) jobs from
> higher priority tier partitions have precedence over jobs from lower
> priority tier partitions, without taking the normal fairshare priority into
> consideration.
>
> Best
> Marcus
>
> On 2/15/19 10:07 AM, David Baker wrote:
>
> Hello.
>
>
> We have a small set of compute nodes owned by a group. The group has
> agreed that the rest of the HPC community can use these nodes providing
> that they (the owners) can always have priority access to the nodes. The
> four nodes are well provisioned (1 TByte memory each plus 2 GRID K2
> graphics cards) and so there is no need to worry about preemption. In fact
> I'm happy for the nodes to be used as well as possible by all users. It's
> just that jobs from the owners must take priority if resources are scarce.
>
>
> What is the best way to achieve the above in slurm? I'm planning to place
> the nodes in their own partition. The node owners will have priority access
> to the nodes in that partition, but will have no advantage when submitting
> jobs to the public resources. Does anyone please have any ideas how to deal
> with this?
>
>
> Best regards,
>
> David
>
>
>
> --
> Marcus Wagner, Dipl.-Inf.
>
> IT Center
> Abteilung: Systeme und Betrieb
> RWTH Aachen University
> Seffenter Weg 23
> 52074 Aachen
> Tel: +49 241 80-24383
> Fax: +49 241 80-624383wag...@itc.rwth-aachen.dewww.itc.rwth-aachen.de
>
>


[slurm-users] Question on billing tres information from sacct, sshare, and scontrol

2019-02-21 Thread David Rhey
Hello,

I have a small vagrant setup I use for prototyping/testing various things.
Right now, it's running Slurm 18.08.4. I am noticing some differences for
the billing TRES in the output of various commands (notably that of sacct,
sshare, and scontrol show assoc).

On a freshly built cluster, therefore with no prior usage data, I run a
basic job to generate some usage data:

[vagrant@head vagrant]$ sshare -n -P -A drhey1 -o GrpTRESRaw
cpu=3,mem=1199,energy=0,node=3,billing=59,fs/disk=0,vmem=0,pages=0
cpu=3,mem=1199,energy=0,node=3,billing=59,fs/disk=0,vmem=0,pages=0

[vagrant@head vagrant]$ sshare -n -P -A drhey1 -o RawUsage
3611
3611

When I look at the same info within sacct I see:

[vagrant@head vagrant]$ sacct -X
--format=User,JobID,Account,AllocTRES%50,AllocGRES,ReqGRES,Elapsed,ExitCode
 UserJobIDAccount
AllocTRESAllocGRES  ReqGRESElapsed ExitCode
-  --
-- 
 -- 
  vagrant 2drhey1
billing=30,cpu=2,mem=600M,node=2 00:02:00
0:0

Of note is that the billing TRES shows as being equal to 30 in sacct, but
50 in sshare. Something similar happens in scontrol show assoc:

...
GrpTRESMins=cpu=N(3),mem=N(1199),energy=N(0),node=N(3),billing=N(59),fs/disk=N(0),vmem=N(0),pages=N(0)
...

Can anyone explain the difference in billing TRES value output between the
various commands? I have a couple of theories, and have been looking
through source code to try and understand a bit better. For context, I am
trying to understand what a job costs, and what usage for an account over a
span of say a month costs.

Any insight is most appreciated!

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


Re: [slurm-users] Priority access for a group of users

2019-03-01 Thread david baker
Hello,

Following up on implementing preemption in Slurm. Thank you again for all
the advice. After a short break I've been able to run some basic
experiments. Initially, I have kept things very simple and made the
following changes in my slurm.conf...

# Premption settings
PreemptType=preempt/partition_prio
PreemptMode=requeue

PartitionName=relgroup nodes=red[465-470] ExclusiveUser=YES
MaxCPUsPerNode=40 DefaultTime=02:00:00 MaxTime=60:00:00 QOS=relgroup
State=UP AllowAccounts=relgroup Priority=10 PreemptMode=off

# Scavenger partition
PartitionName=scavenger nodes=red[465-470] ExclusiveUser=YES
MaxCPUsPerNode=40 DefaultTime=00:15:00 MaxTime=02:00:00 QOS=scavenger
State=UP AllowGroups=jfAccessToIridis5 PreemptMode=requeue

The nodes in the relgroup queue are owned by the General Relativity group
and, of course, they have priority to these nodes. The general population
can scavenge these nodes via the scavenger queue. When I use
"preemptmode=cancel" I'm happy that the relgroup jobs can preempt the
scavenger jobs (and the scavenger jobs are cancelled). When I set the
preempt mode to "requeue" I see that the scavenger jobs are still
cancelled/killed. Have I missed an important configuration change or is it
that lower priority jobs will always be killed and not re-queued?

Could someone please advise me on this issue? Also I'm wondering if I
really understand the "requeue" option. Does that mean re-queued and run
from the beginning or run from the current state (needing check pointing)?

Best regards,
David

On Tue, Feb 19, 2019 at 2:15 PM Prentice Bisbal  wrote:

> I just set this up a couple of weeks ago myself. Creating two partitions
> is definitely the way to go. I created one partition, "general" for normal,
> general-access jobs, and another, "interruptible" for general-access jobs
> that can be interrupted, and then set PriorityTier accordingly in my
> slurm.conf file (Node names omitted for clarity/brevity).
>
> PartitionName=general Nodes=... MaxTime=48:00:00 State=Up PriorityTier=10
> QOS=general
> PartitionName=interruptible Nodes=... MaxTime=48:00:00 State=Up
> PriorityTier=1 QOS=interruptible
>
> I then set PreemptMode=Requeue, because I'd rather have jobs requeued than
> suspended. And it's been working great. There are few other settings I had
> to change. The best documentation for all the settings you need to change
> is https://slurm.schedmd.com/preempt.html
>
> Everything has been working exactly as desired and advertised. My users
> who needed the ability to run low-priority, long-running jobs are very
> happy.
>
> The one caveat is that jobs that will be killed and requeued need to
> support checkpoint/restart. So when this becomes a production thing, users
> are going to have to acknowledge that they will only use this partition for
> jobs that have some sort of checkpoint/restart capability.
>
> Prentice
>
> On 2/15/19 11:56 AM, david baker wrote:
>
> Hi Paul, Marcus,
>
> Thank you for your replies. Using partition priority all makes sense. I
> was thinking of doing something similar with a set of nodes purchased by
> another group. That is, having a private high priority partition and a
> lower priority "scavenger" partition for the public. In this case scavenger
> jobs will get killed when preempted.
>
> In the present case , I did wonder if it would be possible to do something
> with just a single partition -- hence my question.Your replies have
> convinced me that two partitions will work -- with preemption leading to
> re-queued jobs.
>
> Best regards,
> David
>
> On Fri, Feb 15, 2019 at 3:09 PM Paul Edmon  wrote:
>
>> Yup, PriorityTier is what we use to do exactly that here.  That said
>> unless you turn on preemption jobs may still pend if there is no space.  We
>> run with REQUEUE on which has worked well.
>>
>>
>> -Paul Edmon-
>>
>>
>> On 2/15/19 7:19 AM, Marcus Wagner wrote:
>>
>> Hi David,
>>
>> as far as I know, you can use the PriorityTier (partition parameter) to
>> achieve this. According to the manpages (if I remember right) jobs from
>> higher priority tier partitions have precedence over jobs from lower
>> priority tier partitions, without taking the normal fairshare priority into
>> consideration.
>>
>> Best
>> Marcus
>>
>> On 2/15/19 10:07 AM, David Baker wrote:
>>
>> Hello.
>>
>>
>> We have a small set of compute nodes owned by a group. The group has
>> agreed that the rest of the HPC community can use these nodes providing
>> that they (the owners) can always have priority access to the nodes. The
>> four nodes are well provisioned (1 TByte memory each plus 2 GR

Re: [slurm-users] Priority access for a group of users

2019-03-04 Thread david baker
Hello,

Thank you for reminding me about the sbatch "--requeue" option. When I
submit test jobs using this option the preemption and subsequent restart of
a job works as expected. I've also played around with "preemptmode=suspend"
and that also works, however I suspect we won't use that on these
"diskless" nodes.

As I note I can scavenge resources and preempt jobs myself (I am a member
of the "relgroup" and the general public). That is..

347104 scavengermyjob djb1 PD   0:00  1
(Resources)
347105  relgroupmyjob djb1  R  17:00  1 red465

On the other hand I do not seem to be able to preempt a job submitted by a
colleague. That is, my colleague submits a job to the scavenger queue, it
starts to run. I then submit a job to the relgroup queue, however that job
fails to preempt my colleague's job and stays in pending status.

Does anyone understand what might be wrong, please?

Best regards,
David

On Fri, Mar 1, 2019 at 2:47 PM Antony Cleave 
wrote:

> I have always assumed that cancel just kills the job whereas requeue will
> cancel and then start from the beginning. I know that requeue does this. I
> never tried cancel.
>
> I'm a fan of the suspend mode myself but that is dependent on users not
> asking for all the ram by default. If you can educate the users then this
> works really well as the low priority job stays in ram in suspended mode
> while the high priority job completes and then the low priority job
> continues from where it stopped. No checkpoints and no killing.
>
> Antony
>
>
>
> On Fri, 1 Mar 2019, 12:23 david baker,  wrote:
>
>> Hello,
>>
>> Following up on implementing preemption in Slurm. Thank you again for all
>> the advice. After a short break I've been able to run some basic
>> experiments. Initially, I have kept things very simple and made the
>> following changes in my slurm.conf...
>>
>> # Premption settings
>> PreemptType=preempt/partition_prio
>> PreemptMode=requeue
>>
>> PartitionName=relgroup nodes=red[465-470] ExclusiveUser=YES
>> MaxCPUsPerNode=40 DefaultTime=02:00:00 MaxTime=60:00:00 QOS=relgroup
>> State=UP AllowAccounts=relgroup Priority=10 PreemptMode=off
>>
>> # Scavenger partition
>> PartitionName=scavenger nodes=red[465-470] ExclusiveUser=YES
>> MaxCPUsPerNode=40 DefaultTime=00:15:00 MaxTime=02:00:00 QOS=scavenger
>> State=UP AllowGroups=jfAccessToIridis5 PreemptMode=requeue
>>
>> The nodes in the relgroup queue are owned by the General Relativity group
>> and, of course, they have priority to these nodes. The general population
>> can scavenge these nodes via the scavenger queue. When I use
>> "preemptmode=cancel" I'm happy that the relgroup jobs can preempt the
>> scavenger jobs (and the scavenger jobs are cancelled). When I set the
>> preempt mode to "requeue" I see that the scavenger jobs are still
>> cancelled/killed. Have I missed an important configuration change or is it
>> that lower priority jobs will always be killed and not re-queued?
>>
>> Could someone please advise me on this issue? Also I'm wondering if I
>> really understand the "requeue" option. Does that mean re-queued and run
>> from the beginning or run from the current state (needing check pointing)?
>>
>> Best regards,
>> David
>>
>> On Tue, Feb 19, 2019 at 2:15 PM Prentice Bisbal  wrote:
>>
>>> I just set this up a couple of weeks ago myself. Creating two partitions
>>> is definitely the way to go. I created one partition, "general" for normal,
>>> general-access jobs, and another, "interruptible" for general-access jobs
>>> that can be interrupted, and then set PriorityTier accordingly in my
>>> slurm.conf file (Node names omitted for clarity/brevity).
>>>
>>> PartitionName=general Nodes=... MaxTime=48:00:00 State=Up
>>> PriorityTier=10 QOS=general
>>> PartitionName=interruptible Nodes=... MaxTime=48:00:00 State=Up
>>> PriorityTier=1 QOS=interruptible
>>>
>>> I then set PreemptMode=Requeue, because I'd rather have jobs requeued
>>> than suspended. And it's been working great. There are few other settings I
>>> had to change. The best documentation for all the settings you need to
>>> change is https://slurm.schedmd.com/preempt.html
>>>
>>> Everything has been working exactly as desired and advertised. My users
>>> who needed the ability to run low-priority, long-running jobs are very
>>> happy.
>>>
>>> The one caveat is that jobs that will be k

[slurm-users] How do I impose a limit the memory requested by a job?

2019-03-12 Thread David Baker
Hello,


I have set up a serial queue to run small jobs in the cluster. Actually, I 
route jobs to this queue using the job_submit.lua script. Any 1 node job using 
up to 20 cpus is routed to this queue, unless a user submits their job with an 
exclusive flag.


The partition is shared and so I defined memory to be a resource. I've set 
default memory/cpu to be 4300 Mbytes. There are 40 cpus installed in the nodes 
and the usable memory is circa 17200 Mbytes -- hence my default mem/cpu.


The compute nodes are defined with RealMemory=19, by the way.


I am curious to understand how I can impose a memory limit on the jobs that are 
submitted to this partition. It doesn't make any sense to request more than the 
total usable memory on the nodes. So could anyone please advise me how to 
ensure that users cannot request more than the usable memory on the nodes.


Best regards,

David


PartitionName=serial nodes=red[460-464] Shared=Yes MaxCPUsPerNode=40 
DefaultTime=02:00:00 MaxTime=60:00:00 QOS=serial 
SelectTypeParameters=CR_Core_Memory DefMemPerCPU=4300 State=UP 
AllowGroups=jfAccessToIridis5 PriorityJobFactor=10 PreemptMode=off




Re: [slurm-users] How do I impose a limit the memory requested by a job?

2019-03-14 Thread david baker
Hello Paul,

Thank you for your advice. That all makes sense. We're running diskless
compute nodes and so the usable memory is less than the total memory. So I
have added a memory check to my job_submit.lua -- see below. I think that
all makes sense.

Best regards,
David

-- Check memory/node is valid
if job_desc.min_mem_per_cpu == 9223372036854775808 then
  job_desc.min_mem_per_cpu = 4300
end

memory = job_desc.min_mem_per_cpu * job_desc.min_cpus

if memory > 172000 then
  slurm.log_user("You cannot request more than 172000 Mbytes per node")
  slurm.log_user("memory is: %u",memory)
  return slurm.ERROR
end


On Tue, Mar 12, 2019 at 4:48 PM Paul Edmon  wrote:

> Slurm should automatically block or reject jobs that can't run on that
> partition in terms of memory usage for a single node.  So you shouldn't
> need to do anything.  If you need something less than the max memory per
> node then you will need to enforce some limits.  We do this via a jobsubmit
> lua script.  That would be my recommended method.
>
>
> -Paul Edmon-
>
>
> On 3/12/19 12:31 PM, David Baker wrote:
>
> Hello,
>
>
> I have set up a serial queue to run small jobs in the cluster. Actually, I
> route jobs to this queue using the job_submit.lua script. Any 1 node job
> using up to 20 cpus is routed to this queue, unless a user submits
> their job with an exclusive flag.
>
>
> The partition is shared and so I defined memory to be a resource. I've set
> default memory/cpu to be 4300 Mbytes. There are 40 cpus installed in the
> nodes and the usable memory is circa 17200 Mbytes -- hence my default
> mem/cpu.
>
>
> The compute nodes are defined with RealMemory=19, by the way.
>
>
> I am curious to understand how I can impose a memory limit on the jobs
> that are submitted to this partition. It doesn't make any sense to request
> more than the total usable memory on the nodes. So could anyone please
> advise me how to ensure that users cannot request more than the usable
> memory on the nodes.
>
>
> Best regards,
>
> David
>
>
> PartitionName=serial nodes=red[460-464] Shared=Yes MaxCPUsPerNode=40
> DefaultTime=02:00:00 MaxTime=60:00:00 QOS=serial
> SelectTypeParameters=CR_Core_Memory *DefMemPerCPU=4300* State=UP
> AllowGroups=jfAccessToIridis5 PriorityJobFactor=10 PreemptMode=off
>
>
>
>


[slurm-users] Very large job getting starved out

2019-03-21 Thread David Baker
Hello,


I understand that this is not a straight forward question, however I'm 
wondering if anyone has any useful ideas, please. Our cluster is busy and the 
QOS has limited users to a maximum of 32 compute nodes on the "batch" queue. 
Users are making good of the cluster -- for example one user is running five 6 
node jobs at the moment. On the other hand, a job belonging to another user has 
been stalled in the queue for around 7 days. He has made reasonable use of the 
cluster and as a result his fairshare component is relatively low. Having said 
that, the priority of his job is high -- it currently one of the highest 
priority jobs in the batch partition queue. From sprio...


JOBID PARTITION   PRIORITYAGE  FAIRSHAREJOBSIZE  PARTITION
QOS

359323 batch 180292 10  79646547100 
 0


I did think that the PriorityDecayHalfLife was quite high at 14 days and so I 
reduced that to 7 days. For reference I've included the key scheduling settings 
from the cluster below. Does anyone have any thoughts, please?


Best regards,

David


PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod  = 00:05:00
PriorityFavorSmall  = No
PriorityFlags   = ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME,FAIR_TREE
PriorityMaxAge  = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType= priority/multifactor
PriorityWeightAge   = 10
PriorityWeightFairShare = 100
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS   = 1





Re: [slurm-users] Very large job getting starved out

2019-03-21 Thread David Baker
Hi Cyrus,


Thank you for the links. I've taken a good look through the first link (re the 
cloud cluster) and the only parameter that might be relevant is 
"assoc_limit_stop", but I'm not sure if that is relevant in this instance. The 
reason for the delay of the job in question is "priority", however there are 
quite a lot of jobs from users in the same accounting group with jobs delayed 
due to "QOSMaxCpuPerUserLimit". They also talk about using the "builtin" 
scheduler which I guess would turn off backfill.


I have attached a copy of the current slurm.conf so that you and other members 
can get a better feel for the whole picture. Certainly we see a large number of 
serial/small (1 node) jobs running through the system and I'm concerned that my 
setup encourages this behaviour, however how to stem this issue is a mystery to 
me.


If you or anyone else has any relevant thoughts then please let me know. I 
particular I am keen to understand "assoc_limit_stop" and whether it is a 
relevant option in this situation.


Best regards,

David


From: slurm-users  on behalf of Cyrus 
Proctor 
Sent: 21 March 2019 14:19
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Very large job getting starved out


Hi David,


You might have a look at the thread "Large job starvation on cloud cluster" 
that started on Feb 27; there's some good tidbits in there. Off the top without 
more information, I would venture that settings you have in slurm.conf end up 
backfilling the smaller jobs at the expense of scheduling the larger jobs.


Your partition configs plus accounting and scheduler configs from slurm.conf 
would be helpful.


Also, search for "job starvation" here: 
https://slurm.schedmd.com/sched_config.html<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsched_config.html&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cea23798d0ad54a02f14308d6ae0883d5%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=KfjAqNHQgLcUBBYwZFi8OygU2De%2FdVuTwbdOmUv0Dps%3D&reserved=0>
 as another potential starting point.


Best,

Cyrus


On 3/21/19 8:55 AM, David Baker wrote:

Hello,


I understand that this is not a straight forward question, however I'm 
wondering if anyone has any useful ideas, please. Our cluster is busy and the 
QOS has limited users to a maximum of 32 compute nodes on the "batch" queue. 
Users are making good of the cluster -- for example one user is running five 6 
node jobs at the moment. On the other hand, a job belonging to another user has 
been stalled in the queue for around 7 days. He has made reasonable use of the 
cluster and as a result his fairshare component is relatively low. Having said 
that, the priority of his job is high -- it currently one of the highest 
priority jobs in the batch partition queue. From sprio...


JOBID PARTITION   PRIORITYAGE  FAIRSHAREJOBSIZE  PARTITION
QOS

359323 batch 180292 10  79646547100 
 0


I did think that the PriorityDecayHalfLife was quite high at 14 days and so I 
reduced that to 7 days. For reference I've included the key scheduling settings 
from the cluster below. Does anyone have any thoughts, please?


Best regards,

David


PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod  = 00:05:00
PriorityFavorSmall  = No
PriorityFlags   = ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME,FAIR_TREE
PriorityMaxAge  = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType= priority/multifactor
PriorityWeightAge   = 10
PriorityWeightFairShare = 100
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS   = 1





slurm.conf
Description: slurm.conf


Re: [slurm-users] Very large job getting starved out

2019-03-22 Thread David Baker
Hello,


Running the command "squeue -j 359323 --start" gives me the following output...


JOBID PARTITION NAME USER ST  START_TIME  NODES SCHEDNODES  
 NODELIST(REASON)
  359323 batchbatch  jwk PD N/A 27 (null)   
(Resources)

Best regards,
David


From: slurm-users  on behalf of 
Christopher Samuel 
Sent: 21 March 2019 17:54
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Very large job getting starved out

On 3/21/19 6:55 AM, David Baker wrote:

> it currently one of the highest priority jobs in the batch partition queue

What does squeue -j 359323 --start say?

--
   Chris Samuel  :  
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cca6bf6021f694fd32e6908d6ae265930%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=WYLLite%2BAMhQPkkpUzyJavT3nDEoN6mq1uofki8aa1A%3D&reserved=0
  :  Berkeley, CA, USA



[slurm-users] Backfill advice

2019-03-23 Thread david baker
Hello,

We do have large jobs getting starved out on our cluster, and I note
particularly that we never manage to see a job getting assigned a start
time. It seems very possible that backfilled jobs are stealing nodes
reserved for large/higher priority jobs.

I'm wondering if our backfill configuration has any bearing on this issue
or whether we are unfortunate enough to have hit a bug. One parameter that
is missing in our bf setup is "bf_continue". Is that parameter significant
in terms of ensuring that bf drills down sufficiently in the job mix? Also
we are using the default bf frequency -- should we really reduce the
frequency and potentially reduce the number of bf jobs per group/user or
total at each iteration? Currently, I think we are setting the per/user
limit to 20.

Any thoughts would be appreciated, please.

Best regards,
David


Re: [slurm-users] Backfill advice

2019-03-25 Thread David Baker
Hello Doug,


Thank you for your detailed reply regarding how to setup backfill. There's 
quite a lot to take in there. Fortunately, I now have a day or two to read up 
and understand the ideas now that our cluster is down due to a water cooling 
failure. In the first instance, I'll certainly implement bf_continue and 
review/amend the "bf_maxjobs" and "bf_interval" parameters. Switching on 
backfill debugging sounds very useful, but does that setting tend to blot  the 
logs if left enabled for long periods?


We did have a contract with SchedMD which recently finished. In one of the last 
discussions we had it was intimated that we may have hit a bug. That's in the 
respect that backfilled jobs were potentially stealing nodes intended for 
higher priority jobs -- bug 5297. The advice was to consider upgrading to slurm 
18.08.4 and implement bf_ignore_newly_avail_nodes. I was interested to see that 
you had a similar discussion with SchedMD and did upgrade. I think I ought to 
update the bf configuration re my first paragraph and see how that goes before 
we bite the bullet and do the upgrade (we are at 18.08.0 currently).


Best regards,

David


From: slurm-users  on behalf of Douglas 
Jacobsen 
Sent: 23 March 2019 13:30
To: Slurm User Community List
Subject: Re: [slurm-users] Backfill advice

Hello,

At first blush bf_continue and bf_interval as well as bf_maxjobs (if I 
remembered the parameter correctly) are critical first steps in tuning.  
Setting DebugFlags=backfill is essential to getting the needed data to make 
tuning decisions.

Use of per user/account settings if they are too low can also cause starvation 
depending on the way your priority calculation is set up.

I presented these slides a few years ago ag the slurm user group on this topic:
https://slurm.schedmd.com/SLUG16/NERSC.pdf<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2FSLUG16%2FNERSC.pdf&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C479a97721a87443f81c708d6af9455dd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=OC57jAeprk%2Bm1tCCn20mtAVzfvmbcj4AgJ5b3wVPJKI%3D&reserved=0>

The key thing to keep in mind with large jobs is that slurm needs to evaluate 
them again and again in the same order or the scheduled time may drift.  Thus 
it is important that once jobs are getting planning reservations they must 
continue to do so.

Because of the prevalence of large jobs at our site we use  bf_min_prio_resv 
which splits the priority space into a reserving and non-reserving set, and 
then use job age to allow jobs to age from the non reserving portion of the 
priority space to the reservation portion.  Use of the recent 
MaxJobsAccruePerUser limits on a job qos can throttle the rate of jobs aging 
and prevent negative effects from users submitting large numbers of jobs.

I realize that is a large number of tunables and concepts densely packed, but 
it should give you some reasonable starting points.

Doug


On Sat, Mar 23, 2019 at 05:26 david baker 
mailto:djbake...@gmail.com>> wrote:
Hello,

We do have large jobs getting starved out on our cluster, and I note 
particularly that we never manage to see a job getting assigned a start time. 
It seems very possible that backfilled jobs are stealing nodes reserved for 
large/higher priority jobs.

I'm wondering if our backfill configuration has any bearing on this issue or 
whether we are unfortunate enough to have hit a bug. One parameter that is 
missing in our bf setup is "bf_continue". Is that parameter significant in 
terms of ensuring that bf drills down sufficiently in the job mix? Also we are 
using the default bf frequency -- should we really reduce the frequency and 
potentially reduce the number of bf jobs per group/user or total at each 
iteration? Currently, I think we are setting the per/user limit to 20.

Any thoughts would be appreciated, please.

Best regards,
David
--
Sent from Gmail Mobile


[slurm-users] Slurm users meeting 2019?

2019-03-25 Thread david baker


Hello, 

I was searching the web to see if there was going to be a Slurm users’ meeting 
this year, but couldn’t find anything. Does anyone know if there is a users’ 
meeting planned for 2019? If so, is it most likely going to be held as part of 
Supercomputing in Denver? Please let me know if you know what’s planned this 
year.

Best regards,
David

Sent from my iPad


Re: [slurm-users] Slurm users meeting 2019?

2019-03-27 Thread David Baker
Thank you for the date and location of the this year's Slurm User Group Meeting.


Best regards,

David


From: slurm-users  on behalf of Jacob 
Jenson 
Sent: 25 March 2019 21:26:45
To: Slurm User Community List
Subject: Re: [slurm-users] Slurm users meeting 2019?

The 2019 Slurm User Group Meeting will be held  in Salt Lake City at the 
University of Utah on September 17-18.

Registration for this user group meeting typically opens in May.

Jacob


On Mon, Mar 25, 2019 at 2:57 PM david baker 
mailto:djbake...@gmail.com>> wrote:

Hello,

I was searching the web to see if there was going to be a Slurm users’ meeting 
this year, but couldn’t find anything. Does anyone know if there is a users’ 
meeting planned for 2019? If so, is it most likely going to be held as part of 
Supercomputing in Denver? Please let me know if you know what’s planned this 
year.

Best regards,
David

Sent from my iPad


[slurm-users] Effect of PriorityMaxAge on job throughput

2019-04-09 Thread David Baker
Hello,

I've finally got the job throughput/turnaround to be reasonable in our cluster. 
Most of the time the job activity on the cluster sets the default QOS to 32 
nodes (there are 464 nodes in the default queue). Jobs requesting nodes close 
to the QOS level (for example 22 nodes) are scheduled within 24 hours which is 
better than it has been. Still I suspect there is room for improvement. I note 
that these large jobs still struggle to be given a starttime, however many jobs 
are now being given a starttime following my SchedulerParameters makeover.

I used advice from the mailing list and the Slurm high throughput document to 
help me make changes to the scheduling parameters. They are now...

SchedulerParameters=assoc_limit_continue,batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_window=3600,bf_resolution=600,bf_yield_interval=100,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=200

Also..
PriorityFavorSmall=NO
PriorityFlags=SMALL_RELATIVE_TO_TIME,ACCRUE_ALWAYS,FAIR_TREE
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityMaxAge=1-0

The most significant change was actually reducing "PriorityMaxAge" to 7-0 to 
1-0. Before that change the larger jobs could hang around in the queue for 
days. Does it make sense therefore to further reduce PriorityMaxAge to less 
than 1 day? Your advice would be appreciated, please.

Best regards,
David





Re: [slurm-users] Effect of PriorityMaxAge on job throughput

2019-04-10 Thread David Baker
Michael,

Thank you for your reply and your thoughts. These are the priority weights that 
I have configured in the slurm.conf.

PriorityWeightFairshare=100
PriorityWeightAge=10
PriorityWeightPartition=1000
PriorityWeightJobSize=1000
PriorityWeightQOS=1

I've made the PWJobSize to be the highest factor, however I understand that 
that only provides a once-off kick to jobs and so it probably insignificant in 
the longer run . That's followed by the PWFairshare.

Should I really be looking at increasing the PWAge factor to help to "push 
jobs" through the system?

The other issue that might play a part is that we see a lot of single node jobs 
(presumably backfilled) into the system. Users aren't excessively bombing the 
cluster, but maybe some backfill throttling would be useful as well (?)

What are your thoughts having seen the priority factors, please? I've attached 
a copy of the slurm.conf just in case you or anyone else wants to take a more 
complete overview.

Best regards,
David


From: slurm-users  on behalf of Michael 
Gutteridge 
Sent: 09 April 2019 18:59
To: Slurm User Community List
Subject: Re: [slurm-users] Effect of PriorityMaxAge on job throughput


It might be useful to include the various priority factors you've got 
configured.  The fact that adjusting PriorityMaxAge had a dramatic effect 
suggests that the age factor is pretty high- might be worth looking at that 
value relative to the other factors.

Have you looked at PriorityWeightJobSize?  Might have some utility if you're 
finding large jobs getting short-shrift.

 - Michael


On Tue, Apr 9, 2019 at 2:01 AM David Baker 
mailto:d.j.ba...@soton.ac.uk>> wrote:
Hello,

I've finally got the job throughput/turnaround to be reasonable in our cluster. 
Most of the time the job activity on the cluster sets the default QOS to 32 
nodes (there are 464 nodes in the default queue). Jobs requesting nodes close 
to the QOS level (for example 22 nodes) are scheduled within 24 hours which is 
better than it has been. Still I suspect there is room for improvement. I note 
that these large jobs still struggle to be given a starttime, however many jobs 
are now being given a starttime following my SchedulerParameters makeover.

I used advice from the mailing list and the Slurm high throughput document to 
help me make changes to the scheduling parameters. They are now...

SchedulerParameters=assoc_limit_continue,batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_window=3600,bf_resolution=600,bf_yield_interval=100,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=200

Also..
PriorityFavorSmall=NO
PriorityFlags=SMALL_RELATIVE_TO_TIME,ACCRUE_ALWAYS,FAIR_TREE
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityMaxAge=1-0

The most significant change was actually reducing "PriorityMaxAge" to 7-0 to 
1-0. Before that change the larger jobs could hang around in the queue for 
days. Does it make sense therefore to further reduce PriorityMaxAge to less 
than 1 day? Your advice would be appreciated, please.

Best regards,
David





slurm.conf
Description: slurm.conf


Re: [slurm-users] Effect of PriorityMaxAge on job throughput

2019-04-24 Thread David Baker
Hello Michael,

Thank you for your email and apologies for my tardy response. I'm still sorting 
out my mailbox after an Easter break. I've taken your comments on board and 
I'll see how I go with your suggestions.

Best regards,
David

From: slurm-users  on behalf of Michael 
Gutteridge 
Sent: 16 April 2019 16:43
To: Slurm User Community List
Subject: Re: [slurm-users] Effect of PriorityMaxAge on job throughput

(sorry, kind of fell asleep on you  there...)

I wouldn't expect backfill to be a problem since it shouldn't be starting jobs 
that won't complete before the priority reservations start.  We allow jobs to 
go over (overtimelimit) so in our case it can be a problem.

On one of our cloud clusters we had problems with large jobs getting starved so 
we set "assoc_limit_stop" in the scheduler parameters- I think for your config 
it would require removing "assoc_limit_continue" (we're on Slurm 18 and 
_continue is the default, replaced by _stop if you want that behavior).  
However, there we use the builtin scheduler- I'd imagine this would play heck 
with a fairshare/backfill cluster (like our on-campus) though.  However, it is 
designed to prevent large-job starvation.

We'd also had some issues with fairshare hitting the limit pretty quickly- 
basically it stopped being a useful factor in calculating priority- so we set 
FairShareDampeningFactor to 5 to get a little more utility out of that.

I'd suggest looking at the output of sprio to see how your factors are working 
in situ, particularly when you've got a stuck large job.  It may be that the 
SMALL_RELATIVE_TO_TIME could be washing out the job size factor if your larger 
jobs are also longer.

HTH.

M


On Wed, Apr 10, 2019 at 2:46 AM David Baker 
mailto:d.j.ba...@soton.ac.uk>> wrote:
Michael,

Thank you for your reply and your thoughts. These are the priority weights that 
I have configured in the slurm.conf.

PriorityWeightFairshare=100
PriorityWeightAge=10
PriorityWeightPartition=1000
PriorityWeightJobSize=1000
PriorityWeightQOS=1

I've made the PWJobSize to be the highest factor, however I understand that 
that only provides a once-off kick to jobs and so it probably insignificant in 
the longer run . That's followed by the PWFairshare.

Should I really be looking at increasing the PWAge factor to help to "push 
jobs" through the system?

The other issue that might play a part is that we see a lot of single node jobs 
(presumably backfilled) into the system. Users aren't excessively bombing the 
cluster, but maybe some backfill throttling would be useful as well (?)

What are your thoughts having seen the priority factors, please? I've attached 
a copy of the slurm.conf just in case you or anyone else wants to take a more 
complete overview.

Best regards,
David


From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Michael Gutteridge 
mailto:michael.gutteri...@gmail.com>>
Sent: 09 April 2019 18:59
To: Slurm User Community List
Subject: Re: [slurm-users] Effect of PriorityMaxAge on job throughput


It might be useful to include the various priority factors you've got 
configured.  The fact that adjusting PriorityMaxAge had a dramatic effect 
suggests that the age factor is pretty high- might be worth looking at that 
value relative to the other factors.

Have you looked at PriorityWeightJobSize?  Might have some utility if you're 
finding large jobs getting short-shrift.

 - Michael


On Tue, Apr 9, 2019 at 2:01 AM David Baker 
mailto:d.j.ba...@soton.ac.uk>> wrote:
Hello,

I've finally got the job throughput/turnaround to be reasonable in our cluster. 
Most of the time the job activity on the cluster sets the default QOS to 32 
nodes (there are 464 nodes in the default queue). Jobs requesting nodes close 
to the QOS level (for example 22 nodes) are scheduled within 24 hours which is 
better than it has been. Still I suspect there is room for improvement. I note 
that these large jobs still struggle to be given a starttime, however many jobs 
are now being given a starttime following my SchedulerParameters makeover.

I used advice from the mailing list and the Slurm high throughput document to 
help me make changes to the scheduling parameters. They are now...

SchedulerParameters=assoc_limit_continue,batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_window=3600,bf_resolution=600,bf_yield_interval=100,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=200

Also..
PriorityFavorSmall=NO
PriorityFlags=SMALL_RELATIVE_TO_TIME,ACCRUE_ALWAYS,FAIR_TREE
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityMaxAge=1-0

The most significant change was actually reducing "PriorityMaxAge" to 7-0 to 
1-0. Before that change the lar

[slurm-users] Slurm database failure messages

2019-05-07 Thread David Baker
Hello,

We are experiencing quite a number of database failures. We saw an outright 
failure a short while ago where we had to restart the maria database and the 
slurmdbd process. After restarting the database appear to be working well, 
however over the last few days I have notice quite a number of failures. For 
example -- see below. Does anyone understand what might be going wrong, why and 
whether we should be concerned, please? I understand that slurm databases can 
get quite large relatively quickly and so I wonder if this is memory related.

Best regards,
David

[root@blue51 slurm]# less slurmdbd.log-20190506.gz | grep failed
[2019-05-05T04:00:05.603] error: mysql_query failed: 1213 Deadlock found when 
trying to get lock; try restarting transaction
[2019-05-05T04:00:05.606] error: Cluster i5 rollup failed
[2019-05-05T23:00:07.017] error: mysql_query failed: 1213 Deadlock found when 
trying to get lock; try restarting transaction
[2019-05-05T23:00:07.018] error: Cluster i5 rollup failed
[2019-05-06T00:00:13.348] error: mysql_query failed: 1213 Deadlock found when 
trying to get lock; try restarting transaction
[2019-05-06T00:00:13.350] error: Cluster i5 rollup failed


[slurm-users] Partition QOS limits not being applied

2019-05-09 Thread David Carlson
Hi SLURM users,

I work on a cluster, and we recently transitioned to using SLURM on some of
our nodes.  However, we're currently having some difficulty limiting the
number of jobs that a user can run simultaneously in particular
partitions.  Here are the steps we've taken:

1.  Created a new QOS and set MaxJobsPerUser=4 with sacctmgr.
2.  Modified slurm.conf so that the relevant PartitionName line includes
QOS=.
3.  Restarted slurmctld.

However, after taking these steps, the partition in question still does not
have any limits on the number of jobs that a user can run simultaneously.
Is there something wrong here, or are there additional steps that we need
to take?

Any advice is greatly appreciated!
Thanks,
Dave

-- 
Dave Carlson
PhD Candidate
Ecology and Evolution Department
Stony Brook University


[slurm-users] Testing/evaluating new versions of slurm (19.05 in this case)

2019-05-16 Thread David Baker
Hello,

Following the various postings regarding slurm 19.05 I thought it was an 
opportune time to send this question to the forum.

Like others I'm awaiting 19.05 primarily due to the addition of the XFACTOR 
priority setting, but due to other new/improved features as well. I'm 
interested to hear how other admins/groups test (and stress) new versions of 
slurm. That is, how do admins test a new version with a (a) realistic workload 
and (b) with sufficient hardware resources with taking too many hardware 
resources from their production cluster and/or annoying too many users? I 
understand that it is possible to emulate a large cluster on SMP nodes by 
firing up many slurm processes on those nodes, for example.

I have been experimenting with a slurm simulator 
(https://github.com/ubccr-slurm-simulator/slurm_sim_tools/blob/master/doc/slurm_sim_manual.Rmd)
 using historical job data, however that simulator is based on an old version 
of slurm and (to be honest) it's slightly unreliable for serious study. It's 
certainly only useful for broad brush analysis, at the most.

Please let me have your thoughts -- they would be appreciated.

Best regards,
David



[slurm-users] Updating slurm priority flags

2019-05-18 Thread david baker
Hello,

I have a quick question regarding updating the priority flags in the
slurm.conf file. Currently I have the flag "small_relative_to_time" set.
I'm finding that that flag is washing out the job size priority weight
factor and I would like to experiment without it.

So when you remove that priority flag from the configuration should slurm
automatically update the job size priority weight factor for the existing
jobs? I am concerned that existing jobs will not have their priority
changed. Does anyone know how to make this sort of change without adversely
affecting the "dynamics" of existing and new jobs in the cluster? That is,
I don't want existing jobs to lose out cf new jobs re overall priority.

Your advice would be appreciated, please.

Best regards,
David


[slurm-users] Advice on setting up fairshare

2019-06-06 Thread David Baker
Hello,


Could someone please give me some advice on setting up the fairshare in a 
cluster. I don't think the present setup is wildly incorrect, however either my 
understanding of the setup is wrong or something is reconfigured.


When we set a new user up on the cluster and they haven't used any resources am 
I correct in thinking that their fairshare (as reported by sshare -a) should be 
1.0? Looking at a new user,  I see...


[root@blue52 slurm]# sshare -a | grep rk1n15
  soton  rk1n15  10.003135   0  
0.00   0.822165


This is a very simple setup. We have a number of groups (all under root)...


soton -- general public

hydrology - specific groups that have purchased their own nodes.

relgroup

worldpop


What I do for each of these groups, when a new user is added, is increment the 
number of shares per the relevant group using, for example...


sacctmgr modify account soton set fairshare=X


Where X is the number of users in the group (soton in this case).


The sshare -a command would give me a global overview...


 Account   User  RawShares  NormSharesRawUsage  
EffectvUsage  FairShare
 -- -- --- --- 
- --
root  0.00 15431286261  1.00
 root  root  10.002755  40  
0.00   1.00
 hydrology   30.008264 1357382  0.88
  hydrology  da1g18  10.33   0  
0.00   0.876289



Does that all make sense or am I missing something? I am, by the way, using the 
line

PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE in my slurm.conf.


Best regards,

David


Re: [slurm-users] Advice on setting up fairshare

2019-06-07 Thread David Baker
Hi Loris,


Thank you for your reply. I had started to read about 'Fairshare=parent' and 
wondered if that was the way to go. So that all makes sense. I set 
'fairshare=parent' at the account levels and that does the job very well. 
Things are looking much better and now new (and eternally idle) users receive a 
 fairshare of 1 as expected. It certainly makes the scripts/admin a great deal 
less cumbersome.


Best regards,

David


From: slurm-users  on behalf of Loris 
Bennett 
Sent: 07 June 2019 07:11:36
To: Slurm User Community List
Subject: Re: [slurm-users] Advice on setting up fairshare

Hi David,

I have had time to look into your current problem, but inline I have
some comments about the general approach.

David Baker  writes:

> Hello,
>
> Could someone please give me some advice on setting up the fairshare
> in a cluster. I don't think the present setup is wildly incorrect,
> however either my understanding of the setup is wrong or something is
> reconfigured.
>
> When we set a new user up on the cluster and they haven't used any
> resources am I correct in thinking that their fairshare (as reported
> by sshare -a) should be 1.0? Looking at a new user, I see...
>
> [root@blue52 slurm]# sshare -a | grep rk1n15
> soton rk1n15 1 0.003135 0 0.00 0.822165
>
> This is a very simple setup. We have a number of groups (all under
> root)...
>
> soton -- general public
>
> hydrology - specific groups that have purchased their own nodes.
>
> relgroup
>
> worldpop
>
> What I do for each of these groups, when a new user is added, is
> increment the number of shares per the relevant group using, for
> example...
>
> sacctmgr modify account soton set fairshare=X
>
> Where X is the number of users in the group (soton in this case).

I did this for years, wrote added logic to automatically
increment/decrement shares when user were added/deleted/moved, but
recently realised that for our use-case it is not necessary.

The way shares are seem to be intended to work is that some project gets
a fixed allocation on the system, or some group buys a certain number of
node for the cluster.  Shares are then dished out based on the size of
the project or number of nodes and are thus fairly static.

You seem to have more of a setup like we do: a centrally financed system
which is free to use and where everyone is treated equally.  What we now
do is have the Fairshare parameter for all accounts in the hierarchy set
to "Parent".  This means that everyone ends up with one normalised share
and no changes have to be propagated through the hierarchy.

We also added creating the Slurm association to the submit plugin, so
that if someone applies for access, but never logs in, we can remove
them from the system after four weeks without having to clear up in
Slurm as well.

Maybe this kind of approach might work for you, too.

Cheers,

Loris

> The sshare -a command would give me a global overview...
>
> Account User RawShares NormShares RawUsage EffectvUsage FairShare
>  -- -- --- --- 
> - --
> root 0.00 15431286261 1.00
> root root 1 0.002755 40 0.00 1.00
> hydrology 3 0.008264 1357382 0.88
> hydrology da1g18 1 0.33 0 0.00 0.876289
> 
>
> Does that all make sense or am I missing something? I am, by the way,
> using the line
>
> PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE in my slurm.conf.
>
> Best regards,
>
> David
>
>
--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



[slurm-users] Deadlocks in slurmdbd logs

2019-06-19 Thread David Baker
Hello,


Everyday we see several deadlocks in our slurmdbd log file. Together with the 
deadlock we always see a failed "roll up" operation. Please see below for an 
example.


We are running slurm 18.08.0 on our cluster. As far as we know these deadlocks 
are not adversely affecting the operation of the cluster. Each day jobs are 
"rolling" through the cluster and the utilisation of the cluster is constantly 
high. Furthermore, it doesn't appear that we are losing data in the database. 
I'm not a database expert and so I have no idea where to start with this. Our 
local db experts have taken a look and are nonplussed.


I wondered if anyone in the community had any ideas please. As an aside I've 
just started to experiment with v19* and it would be nice to think that these 
deadlocks will just go away in due course (following an eventual upgrade when 
that version is a bit more mature), however that may not be the case.


Best regards,

David


[2019-06-19T00:00:02.728] error: mysql_query failed: 1213 Deadlock found when 
trying to get lock; try restarting transaction
insert into "i5_assoc_usage_hour_table"
.


[2019-06-19T00:00:02.729] error: Couldn't add assoc hour rollup
[2019-06-19T00:00:02.729] error: Cluster i5 rollup failed



[slurm-users] Requirement to run longer jobs

2019-07-03 Thread David Baker
Hello,


A few of our users have asked about running longer jobs on our cluster. 
Currently our main/default compute partition has a time limit of 2.5 days. 
Potentially, a handful of users need jobs to run up to 5 hours. Rather than 
allow all users/jobs to have a run time limit of 5 days I wondered if the 
following scheme makes sense...


Increase the max run time on the default partition to be 5 days, however limit 
most users to a max of 2.5 days using the default "normal" QOS.


Create a QOS called "long" with a max time limit of 5 days. Limit the user who 
can use "long". For authorized users assign "long" QOS to their jobs on basis 
of run time request.


Does the above make sense or is it too complicated? If the above works could 
users limited to using the normal QOS have their running jobs run time 
increased to 5 days in exceptional circumstances?


I would be interested in your thoughts, please.


Best regards,

David


Re: [slurm-users] Requirement to run longer jobs

2019-07-05 Thread David Baker
Hello,


Thank you to everyone who replied to my email. I'll need to experiment and see 
how I get on.


Best regards,

David



From: slurm-users  on behalf of Loris 
Bennett 
Sent: 04 July 2019 06:53
To: Slurm User Community List
Subject: Re: [slurm-users] Requirement to run longer jobs

Hi Chris,

Chris Samuel  writes:

> On 3/7/19 8:49 am, David Baker wrote:
>
>> Does the above make sense or is it too complicated?
>
> [looks at our 14 partitions and 112 QOS's]
>
> Nope, that seems pretty simple.  We do much the same here.

Out of interest, how many partitions and QOSs would an average user
actually every use?

I'm coming from a very simple set-up which originally had just 3
partitions and 3 QOSs.  We have now gone up to 6 partitions and I'm
already worrying that it's getting too complicated 😅

Cheers,

Loris

--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] Invalid qos specification

2019-07-15 Thread David Rhey
I ran into this recently. You need to make sure your user account has
access to that QoS through sacctmgr. Right now I'd say if you did sacctmgr
show user  withassoc that the QoS you're attempting to use is NOT
listed as part of the association.

On Mon, Jul 15, 2019 at 2:53 PM Prentice Bisbal  wrote:

> Slurm users,
>
> I have created a partition named general should allow the QOSes
> 'general' and 'debug':
>
> PartitionName=general Default=YES AllowQOS=general,debug Nodes=.
>
> However, when I try to request that QOS, I get an error:
>
> $ salloc -p general -q debug  -t 00:30:00
> salloc: error: Job submit/allocate failed: Invalid qos specification
>
> I'm sure I'm overlooking  something obvious. Any idea what that may be?
> I'm using slurm 18.08.8 on the slurm controller, and the clients are
> still at 18.08.7 until tomorrow morning.
>
> --
> Prentice
>
>
>

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


Re: [slurm-users] Cluster-wide GPU Per User limit

2019-07-17 Thread David Rhey
Unfortunately, I think you're stuck in setting it at the account level with
sacctmgr. You could also set that limit as part of a QoS and then attach
the QoS to the partition. But I think that's as granular as you can get for
limiting TRES'.

HTH!

David

On Wed, Jul 17, 2019 at 10:11 AM Mike Harvey  wrote:

>
> Is it possible to set a cluster level limit of GPUs per user? We'd like
> to implement a limit of how many GPUs a user may use across multiple
> partitions at one time.
>
> I tried this, but it obviously isn't correct:
>
> # sacctmgr modify cluster slurm_cluster set MaxTRESPerUser=gres/gpu=2
>   Unknown option: MaxTRESPerUser=gres/gpu=2
>   Use keyword 'where' to modify condition
>
>
> Thanks!
>
> --
> Mike Harvey
> Systems Administrator
> Engineering Computing
> Bucknell University
> har...@bucknell.edu
>
>

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


Re: [slurm-users] No error/output/run

2019-07-24 Thread David Rhey
>From your email it looks like you submitted the job, ran squeue and saw
that it either didn't start or completed very quickly. I'd start with the
job ExitCode info from sacct.

On Wed, Jul 24, 2019 at 4:34 AM Mahmood Naderan 
wrote:

> Hi,
> I don't know why no error/output file is generated after the job
> submission.
>
> $ ls -l
> total 8
> -rw-r--r-- 1 montazeri montazeri 472 Jul 24 12:52 in.lj
> -rw-rw-r-- 1 montazeri montazeri 254 Jul 24 12:53 slurm_script.sh
> $ cat slurm_script.sh
> #!/bin/bash
> #SBATCH --job-name=my_lammps
> #SBATCH --output=out.lj
> #SBATCH --partition=EMERALD
> #SBATCH --account=z55
> #SBATCH --mem=4GB
> #SBATCH --nodes=4
> #SBATCH --ntasks-per-node=3
> mpirun -np 12 /share/apps/softwares/lammps-12Dec18/src/lmp_mpi -in in.lj
>
> $ sbatch slurm_script.sh
> Submitted batch job 1277
> $ squeue
>  JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
> $ ls
> in.lj  slurm_script.sh
> $
>
>
> What does that mean?
>
> Regards,
> Mahmood
>
>
>

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


[slurm-users] Slurm node weights

2019-07-25 Thread David Baker
Hello,


I'm experimenting with node weights and I'm very puzzled by what I see. Looking 
at the documentation I gathered that jobs will be allocated to the nodes with 
the lowest weight which satisfies their requirements. I have 3 nodes in a 
partition and I have defined the nodes like so..


NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN Weight=50
NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN


So, given that the default weight is 1 I would expect jobs to be allocated to 
orange02 and orange03 first. I find, however that my test job is always 
allocated to orange01 with the higher weight. Have I overlooked something? I 
would appreciate your advice, please.




Re: [slurm-users] Slurm node weights

2019-07-25 Thread David Baker
Hello,


As an update I note that I have tried restarting the slurmctld, however that 
doesn't help.


Best regards,

David


From: slurm-users  on behalf of David 
Baker 
Sent: 25 July 2019 11:47:35
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Slurm node weights


Hello,


I'm experimenting with node weights and I'm very puzzled by what I see. Looking 
at the documentation I gathered that jobs will be allocated to the nodes with 
the lowest weight which satisfies their requirements. I have 3 nodes in a 
partition and I have defined the nodes like so..


NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN Weight=50
NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN


So, given that the default weight is 1 I would expect jobs to be allocated to 
orange02 and orange03 first. I find, however that my test job is always 
allocated to orange01 with the higher weight. Have I overlooked something? I 
would appreciate your advice, please.




Re: [slurm-users] Slurm node weights

2019-07-25 Thread David Baker
Hello,


Thank you for the replies. We're running an early version of Slurm 18.08 and it 
does appear that the node weights are being ignored re the bug.


We're experimenting with Slurm 19*, however we don't expect to deploy that new 
version for quite a while. In the meantime does anyone know if there any fix or 
alternative strategy that might help us to achieve the same result?


Best regards,

David


From: slurm-users  on behalf of Sarlo, 
Jeffrey S 
Sent: 25 July 2019 12:26
To: Slurm User Community List 
Subject: Re: [slurm-users] Slurm node weights


Which version of Slurm are you running?  I know some of the earlier versions of 
18.08 had a bug and node weights were not working.


Jeff



From: slurm-users  on behalf of David 
Baker 
Sent: Thursday, July 25, 2019 6:09 AM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Slurm node weights


Hello,


As an update I note that I have tried restarting the slurmctld, however that 
doesn't help.


Best regards,

David


From: slurm-users  on behalf of David 
Baker 
Sent: 25 July 2019 11:47:35
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Slurm node weights


Hello,


I'm experimenting with node weights and I'm very puzzled by what I see. Looking 
at the documentation I gathered that jobs will be allocated to the nodes with 
the lowest weight which satisfies their requirements. I have 3 nodes in a 
partition and I have defined the nodes like so..


NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN Weight=50
NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN


So, given that the default weight is 1 I would expect jobs to be allocated to 
orange02 and orange03 first. I find, however that my test job is always 
allocated to orange01 with the higher weight. Have I overlooked something? I 
would appreciate your advice, please.




Re: [slurm-users] Slurm node weights

2019-07-25 Thread David Baker
Hi Jeff,


Thank you for these details. so far we have never implemented any Slurm fixes. 
I suspect the node weights feature is quite important and useful, and it's 
probably worth me investigating this fix. In this respect could you please 
advise me?


If I use the fix to regenerate the "slurm-slurmd" rpm can I then stop the 
slurmctld processes on the servers, re-install the revised rpm and finally 
restart the slurmctld processes? Most importantly, can this replacement/fix be 
done on a live system that is running jobs, etc? That's assuming that we 
regard/announce the system to be at risk. Or alternatively, do we need to 
arrange downtime, etc?


Best regards,

David





From: slurm-users  on behalf of Sarlo, 
Jeffrey S 
Sent: 25 July 2019 13:04
To: Slurm User Community List 
Subject: Re: [slurm-users] Slurm node weights


This is the fix if you want to modify the code and rebuild


https://github.com/SchedMD/slurm/commit/f66a2a3e2064<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fcommit%2Ff66a2a3e2064&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc72db5f7dab1400983e008d710f8840c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=bhMG78N1%2FQ2ZInn599QuEQ6tyD5pRXAIomlNja1f3j0%3D&reserved=0>

I think 18.08.04 and later have it fixed.

Jeff


____
From: slurm-users  on behalf of David 
Baker 
Sent: Thursday, July 25, 2019 6:53 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] Slurm node weights


Hello,


Thank you for the replies. We're running an early version of Slurm 18.08 and it 
does appear that the node weights are being ignored re the bug.


We're experimenting with Slurm 19*, however we don't expect to deploy that new 
version for quite a while. In the meantime does anyone know if there any fix or 
alternative strategy that might help us to achieve the same result?


Best regards,

David


From: slurm-users  on behalf of Sarlo, 
Jeffrey S 
Sent: 25 July 2019 12:26
To: Slurm User Community List 
Subject: Re: [slurm-users] Slurm node weights


Which version of Slurm are you running?  I know some of the earlier versions of 
18.08 had a bug and node weights were not working.


Jeff


____
From: slurm-users  on behalf of David 
Baker 
Sent: Thursday, July 25, 2019 6:09 AM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Slurm node weights


Hello,


As an update I note that I have tried restarting the slurmctld, however that 
doesn't help.


Best regards,

David

____
From: slurm-users  on behalf of David 
Baker 
Sent: 25 July 2019 11:47:35
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Slurm node weights


Hello,


I'm experimenting with node weights and I'm very puzzled by what I see. Looking 
at the documentation I gathered that jobs will be allocated to the nodes with 
the lowest weight which satisfies their requirements. I have 3 nodes in a 
partition and I have defined the nodes like so..


NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN Weight=50
NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 
RealMemory=1018990 State=UNKNOWN


So, given that the default weight is 1 I would expect jobs to be allocated to 
orange02 and orange03 first. I find, however that my test job is always 
allocated to orange01 with the higher weight. Have I overlooked something? I 
would appreciate your advice, please.




[slurm-users] Slurm statesave directory -- location and management

2019-08-28 Thread David Baker
Hello,

I apologise that this email is a bit vague, however we are keen to understand 
the role of the Slurm "StateSave" location. I can see the value of the 
information in this location when, for example, we are upgrading Slurm and the 
database is temporarily down, however as I note above we are keen to gain a 
much better understanding of this directory.

We have two Slurm controller nodes (one of them is a backup controller), and 
currently we have put the "StateSave" directory on one of the global GPFS file 
stores. In other respects Slurm operates independently of the GPFS file stores 
-- apart from the fact that if GPFS fails jobs will subsequently fail. There 
was a GPFS failure when I was away from the university. Once GPFS had been 
restored they attempted to start Slurm, however the StateSave data was out of 
date. They eventually restarted Slurm, however lost all the queued jobs and the 
job sequence counter restarted at one.

Am I correct in thinking the the information in the StateSave location relates 
to the state of (a) jobs currently running on the cluster and (b) jobs queued? 
Am I also correct in thinking that this information is not stored in the slurm 
database? In other words if you lose the statesave data or it gets corrupted 
then you will lose all running/queued jobs?

Any advice on the management and location of the statesave directory in a dual 
controller system would be appreciated, please.

Best regards,
David


[slurm-users] oddity with users showing in sacctmgr and sreport

2019-09-12 Thread David Rhey
Hello,

First issue: I have a couple dozen users that show up for an account but
outside of the hierarchical structure in sacctmgr:

sacctmgr show assoc account=
format=Account,User,Cluster,ParentName%20 Tree

When I execute that on a given account, I see that one user resides outside
account where the account specifies its Parent Name (i.e. the user is part
of the account, just not tucked correctly into the hierarchy):

 Account   User Cluster Par Name
 --  -- 
acct rogueuser mycluster
acct   mycluster acct_root
 acctnormaluser  mycluster


Second issue: A couple of rogue users from within sacctmgr show up in
sreport output ABOVE the account usage:

sreport -T billing cluster AccountUtilizationByUser Accounts=
Start=2019-08-01 End=2019-09-11

returns the output, but the first line in the output given is that of a
USER and not the specified account as noted in the man sreport section:

   cluster AccountUtilizationByUser
  This report will display account utilization as it appears on
the hierarchical tree.  Starting with the specified account or
  the  root  account by default this report will list the
underlying usage with a sum on each level.  Use the 'tree' option to
  span the tree for better visibility.  NOTE: If there were
reservations allowing a whole account any idle time in the  reser‐
  vation  given  to  the association for the account, not the
user associations in the account, so it can be possible a parent
  account can be larger than the sum of it's children.

Has anyone seen either of these behaviors? I've even queried the DB just to
see if there wasn't something more obvious as to the issue, but I can't
find anything. The associations are very tidy in the DB. When I dump the
cluster using sacctmgr I can see the handful of rogue associations all the
way at the top of the list, meaning they aren't a part of the root
hierarchy in sacctmgr.

We're using 18.08.7.

Thanks!

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


Re: [slurm-users] Maxjobs not being enforced

2019-09-17 Thread David Rhey
Hi, Tina,

Could you send the command you ran?

David

On Tue, Sep 17, 2019 at 2:06 PM Tina Fora  wrote:

> Hello Slurm user,
>
> We have 'AccountingStorageEnforce=limits,qos' set in our slurm.conf. I've
> added maxjobs=100 for a particular user causing havoc on our shared
> storage. This setting is still not being enforced and the user is able to
> launch 1000s of jobs.
>
> I also ran 'scontrol reconfig' and even restarted slurmd on the computes
> but no luck. I'm on 17.11. Are there additional steps to limit a user?
>
> Best,
> T
>
>
>

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


Re: [slurm-users] Maxjobs not being enforced

2019-09-18 Thread David Rhey
Hi, Tina,

Are you able to confirm whether or not you can view the limit for the user
in scontrol as well?

David

On Tue, Sep 17, 2019 at 4:42 PM Tina Fora  wrote:

>
> # sacctmgr modify user lif6 set maxjobs=100
>
> # sacctmgr list assoc user=lif6 format=user,maxjobs,maxsubmit,maxtresmins
> User MaxJobs MaxSubmit   MaxTRESMins
> -- --- - -
>   lif6  100
>
>
>
> > Hi, Tina,
> >
> > Could you send the command you ran?
> >
> > David
> >
> > On Tue, Sep 17, 2019 at 2:06 PM Tina Fora  wrote:
> >
> >> Hello Slurm user,
> >>
> >> We have 'AccountingStorageEnforce=limits,qos' set in our slurm.conf.
> >> I've
> >> added maxjobs=100 for a particular user causing havoc on our shared
> >> storage. This setting is still not being enforced and the user is able
> >> to
> >> launch 1000s of jobs.
> >>
> >> I also ran 'scontrol reconfig' and even restarted slurmd on the computes
> >> but no luck. I'm on 17.11. Are there additional steps to limit a user?
> >>
> >> Best,
> >> T
> >>
> >>
> >>
> >
> > --
> > David Rhey
> > ---
> > Advanced Research Computing - Technology Services
> > University of Michigan
> >
>
>
>

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


[slurm-users] Advice on setting a partition QOS

2019-09-25 Thread David Baker
Hello,

I have defined a partition and corresponding QOS in Slurm. This is the serial 
queue to which we route jobs that require up to (and including) 20 cpus. The 
nodes controlled by serial are shared. I've set the QOS like so..

[djb1@cyan53 slurm]$ sacctmgr show qos serial format=name,maxtresperuser
  Name MaxTRESPU
-- -
serial   cpu=120

The max cpus/user is set high to try to ensure (as often as possible) that the 
nodes are all busy and not in mixed states. Obviously this cannot be the case 
all the time -- depending upon memory requirements, etc.

I noticed that a number of jobs were pending with the reason 
QOSMaxNodePerUserLimit. I've tried firing test jobs to the queue myself and 
noticed that I can never have more than 32 jobs running (each requesting 1 cpu) 
and the rest are pending as per the reason above. Since the QOS cpu/user limit 
is set to 120 I would expect to be able to run more jobs -- given that some 
serial nodes are still not fully occupied. Furthermore, I note that other users 
appear not to be able to use more then 32 cpus in the queue.

The 32 limit does make a degree of sense. The "normal" QOS is set to 
cpus/user=1280, nodes/user=32. It's almost like the 32 cpus in the serial queue 
are being counted as nodes -- as per the pending reason.

Could someone please help me understand this issue and how to avoid it?

Best regards,
David


Re: [slurm-users] Advice on setting a partition QOS

2019-09-25 Thread David Baker
Dear Jurgen,

Thank you for your reply. So, in respond to your suggestion I submitted a batch 
of jobs each asking for 2 cpus. Again I was able to get 32 jobs running at 
once. I presume this is a weird interaction with the normal QOS. In that 
respect would it be best to redefine the normal OQS simply in terms of cpu/user 
usage? That is, not cpus/user and nodes/user.

Best regards,
David


From: slurm-users  on behalf of Juergen 
Salk 
Sent: 25 September 2019 14:52
To: Slurm User Community List 
Subject: Re: [slurm-users] Advice on setting a partition QOS

Dear David,

as it seems, Slurm counts allocated nodes on a per job basis,
i.e. every individual one-core jobs counts as an additional node
even if they all run on one and the same node.

Can you allocate 64 CPUs at the same time when requesting 2 CPUs
per job?

We've also had this (somewhat strange) behaviour with Moab and
therefore implemented limits based on processor counts rather
than node counts per user. This is obviously no issue for exclusive
node scheduling, but for non-exclusive nodes it is (or at least may
be).

Best regards
Jürgen

--
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471




* David Baker  [190925 12:12]:
> Hello,
>
> I have defined a partition and corresponding QOS in Slurm. This is
> the serial queue to which we route jobs that require up to (and
> including) 20 cpus. The nodes controlled by serial are shared. I've
> set the QOS like so..
>
> [djb1@cyan53 slurm]$ sacctmgr show qos serial format=name,maxtresperuser
>   Name MaxTRESPU
> -- -
> serial   cpu=120
>
> The max cpus/user is set high to try to ensure (as often as
> possible) that the nodes are all busy and not in mixed states.
> Obviously this cannot be the case all the time -- depending upon
> memory requirements, etc.
>
> I noticed that a number of jobs were pending with the reason
> QOSMaxNodePerUserLimit. I've tried firing test jobs to the queue
> myself and noticed that I can never have more than 32 jobs running
> (each requesting 1 cpu) and the rest are pending as per the reason
> above. Since the QOS cpu/user limit is set to 120 I would expect to
> be able to run more jobs -- given that some serial nodes are still
> not fully occupied. Furthermore, I note that other users appear not
> to be able to use more then 32 cpus in the queue.
>
> The 32 limit does make a degree of sense. The "normal" QOS is set to
> cpus/user=1280, nodes/user=32. It's almost like the 32 cpus in the
> serial queue are being counted as nodes -- as per the pending
> reason.
>
> Could someone please help me understand this issue and how to avoid it?
>
> Best regards,
> David




[slurm-users] How to modify the normal QOS

2019-09-26 Thread David Baker
Hello,

Currently my normal QOS specifies MaxTRESPU=cpu=1280,nodes=32. I've tried a 
number of edits, however I haven't yet found a way of redefining the MaxTRESPU 
to be "cpu=1280". In the past I have resorted to deleting a QOS completely and 
redefining the whole thing, but in this case I'm not sure if I can delete the 
normal QOS on a running cluster.

I have tried commands like the following to no avail..

sacctmgr update qos normal set maxtresperuser=cpu=1280

Could anyone please help with this.

Best regards,
David


Re: [slurm-users] How to modify the normal QOS

2019-09-26 Thread David Baker
Dear Jurgen,

Thank you for that. That does the expected job. It looks like the weirdness 
that I saw in the serial partition has now gone away and so that is good.

Best regards,
David

From: slurm-users  on behalf of Juergen 
Salk 
Sent: 26 September 2019 16:18
To: Slurm User Community List 
Subject: Re: [slurm-users] How to modify the normal QOS

* David Baker  [190926 14:12]:
>
> Currently my normal QOS specifies MaxTRESPU=cpu=1280,nodes=32. I've
> tried a number of edits, however I haven't yet found a way of
> redefining the MaxTRESPU to be "cpu=1280". In the past I have
> resorted to deleting a QOS completely and redefining the whole
> thing, but in this case I'm not sure if I can delete the normal QOS
> on a running cluster.
>
> I have tried commands like the following to no avail..
>
> sacctmgr update qos normal set maxtresperuser=cpu=1280
>
> Could anyone please help with this.

Dear David,

does this work for you?

 sacctmgr update qos normal set MaxTRESPerUser=node=-1

Best regards
Jürgen

--
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471



[slurm-users] Slurm very rarely assigned an estimated start time to a job

2019-10-02 Thread David Baker
Hello.,

I have just started to take a look at Slurm v19* with a view to an upgrade 
(most likely in the next year). My reaction is that Slurm very rarely provides 
an estimated start time for a job. I understand that this is not possible for 
jobs on hold and dependent jobs. On the other hand I've just submitted a set of 
simple test jobs to our development cluster, and I see that none of the queued 
jobs have an estimated start time. These jobs haven't been queuing for too long 
(30 minutes or so). That is...

[root@headnode-1 slurm]# squeue --start
 JOBID PARTITION NAME USER ST  START_TIME  NODES 
SCHEDNODES   NODELIST(REASON)
12 batchmyjob  hpcdev1 PD N/A  1 
(null)   (Resources)
13 batchmyjob  hpcdev1 PD N/A  1 
(null)   (Resources)
14 batchmyjob  hpcdev1 PD N/A  1 
(null)   (Resources)
15 batchmyjob  hpcdev1 PD N/A  1 
(null)   (Resources)
etc

Is this what others see or are there any recommended configurations/tips/tricks 
to make sure that slurm provides estimates? Any advice would be appreciated, 
please.

Best regards,
David


Re: [slurm-users] Does Slurm store "time in current state" values anywhere ?

2019-10-03 Thread David Rhey
Hi,

What about scontrol show job  to see various things like:

SubmitTime, EligibleTime, AccrueTime etc?

David

On Thu, Oct 3, 2019 at 4:53 AM Kevin Buckley 
wrote:

> Hi there,
>
> we're hoping to overcome an issue where some of our users are keen
> on writing their own meta-schedulers, so as to try and beat the
> actual scheduler, but can't seemingly do as good a job as a scheduler
> that's been developed by people who understand scheduling (no real
> surprises there!), and so occasionally generate false perceptions of
> our systems.
>
> One of the things our meta-scheduler writers seem unable to allow for,
> is jobs that remain in a "completing" state, for whatever reason.
>
> Whilst we're not looking to provide succour to meta-scheduler writers,
> we can see a need for some way to present and/or make use of, a
>
>"job has been in state S for time T"
>
> or
>
>"job entered current state at time T"
>
> info.
>
>
> Can we access such a value from Slurm: rather, does Slurm keep track
> of such a value, whether or not it can currently be accessed on the
> "user-side" ?
>
>
> What we're trying to avoid is the need to write a not-quite-Slurm
> database that stores such info by continually polling our actual
> Slurm database, because we don't think of ourselves as meta-scheduler
> writers.
>
> Here's hoping,
> Kevin
>
> --
> Supercomputing Systems Administrator
> Pawsey Supercomputing Centre
>
>

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


Re: [slurm-users] Slurm very rarely assigned an estimated start time to a job

2019-10-03 Thread David Rhey
We've been working to tune our backfill scheduler here. Here is a
presentation some of you might have seen at a previous SLUG on tuning the
backfill scheduler. HTH!

https://slurm.schedmd.com/SUG14/sched_tutorial.pdf

David

On Wed, Oct 2, 2019 at 1:37 PM Mark Hahn  wrote:

> >(most likely in the next year). My reaction is that Slurm very rarely
> >provides an estimated start time for a job. I understand that this is not
> >possible for jobs on hold and dependent jobs.
>
> it's also not possible if both running and queued jobs
> lack definite termination times; do yours?
>
> my understanding is the following:
> the main scheduler does not perform forward planning.
> that is, it is opportunistic.  it walks the list of priority-sorted
> pending jobs, starting any which can run on currently free
> (or preemptable) resources.
>
> the backfill scheduler is a secondary, asynchronous loop that tries hard
> not to interfere with the main scheduler (severely throttles itself)
> and tries to place start times for pending jobs.
>
> the main issue with forward scheduling is that if high-prio jobs become
> runnable (submitted, off hold, dependency-satisfied), then most of the
> (tentative) start times probably need to be removed.
>
> a quick look at plugins/sched/backfill/backfill.c indicates that things
> are /complicated/ ;)
>
> we (ComputeCanada) don't see a lot of forward start times either.
>
> I also would welcome discussion of how to tune the backfill scheduler!
> I suspect that in order to work well, it needs a particular distribution
> of job priorities.
>
> regards, mark hahn.
>
>

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


[slurm-users] Running job using our serial queue

2019-11-04 Thread David Baker
Hello,

We decided to route all jobs requesting from 1 to 20 cores to our serial queue. 
Furthermore, the nodes controlled by the serial queue are shared by multiple 
users. We did this to try to reduce the level of fragmentation across the 
cluster -- our default "batch" queue provides exclusive access to compute nodes.

It looks like the downside of the serial queue is that jobs from different 
users can interact quite badly. To some extent this is an education issue -- 
for example matlab users need to be told to add the "-singleCompThread" option 
to their command line. On the other hand I wonder if our cgroups setup is 
optimal for the serial queue. Our cgroup.conf contains...

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
TaskAffinity=no

CgroupMountpoint=/sys/fs/cgroup

The relevant cgroup configuration in the slurm.conf is...
ProctrackType=proctrack/cgroup
TaskPlugin=affinity,cgroup

Could someone please advise us on the required/recommended cgroup setup for the 
above scenario? For example, should we really set "TaskAffinity=yes"? I assume 
the interaction between jobs (sometimes jobs can get stalled) is due to context 
switching at the kernel level, however (apart from educating users) how can we 
minimise that switching on the serial nodes?

Best regards,
David



Re: [slurm-users] Running job using our serial queue

2019-11-05 Thread David Baker
Hello,

Thank you for your replies. I double checked that the "task" in, for example, 
taskplugin=task/affinity is optional. In this respect it is good to know that 
we have  the correct cgroups setup. So in theory users should only disturb 
themselves, however in reality we find that there is often a knock on effect on 
other users' jobs. So, for example, users have complained that their jobs 
sometimes stall. I can only vaguely think that something odd is going on at the 
kernel level perhaps.

One additional thing that I need to ask is... Should we have hwloc installed 
our compute nodes? Does that help? Whenever I check which processes are not 
being constrained by cgroups I only ever find a small group of system processes.

Best regards,
David





From: slurm-users  on behalf of Marcus 
Wagner 
Sent: 05 November 2019 07:47
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Running job using our serial queue

Hi David,

doing it the way you do it, is the same way, we do it.

When the Matlab job asks for one CPU, it only gets on CPU this way. That means, 
that all the processes are bound to this one CPU. So (theoretically) the user 
is just disturbing himself, if he uses more.

But especially Matlab, there are more things to do. I t does not suffice to add 
'-singleCompThread' to the commandline. Matlab is not the only tool, that tries 
to use all cores, it finds on the node.
The same is valid for CPLEX and Gurobi, both often used from Matlab. So even, 
if the user sets '-singleCompThread' for Matlab, that does not mean at all, the 
job is only using one CPU.


Best
Marcus

On 11/4/19 4:14 PM, David Baker wrote:
Hello,

We decided to route all jobs requesting from 1 to 20 cores to our serial queue. 
Furthermore, the nodes controlled by the serial queue are shared by multiple 
users. We did this to try to reduce the level of fragmentation across the 
cluster -- our default "batch" queue provides exclusive access to compute nodes.

It looks like the downside of the serial queue is that jobs from different 
users can interact quite badly. To some extent this is an education issue -- 
for example matlab users need to be told to add the "-singleCompThread" option 
to their command line. On the other hand I wonder if our cgroups setup is 
optimal for the serial queue. Our cgroup.conf contains...

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
TaskAffinity=no

CgroupMountpoint=/sys/fs/cgroup

The relevant cgroup configuration in the slurm.conf is...
ProctrackType=proctrack/cgroup
TaskPlugin=affinity,cgroup

Could someone please advise us on the required/recommended cgroup setup for the 
above scenario? For example, should we really set "TaskAffinity=yes"? I assume 
the interaction between jobs (sometimes jobs can get stalled) is due to context 
switching at the kernel level, however (apart from educating users) how can we 
minimise that switching on the serial nodes?

Best regards,
David



--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de<mailto:wag...@itc.rwth-aachen.de>
www.itc.rwth-aachen.de<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.itc.rwth-aachen.de&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cf4fb53d4fef74523599b08d761c4ac18%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=dtF928nvXUbXjpc4COy5bB9Qrs9LoZE8ePa26Ydjdsc%3D&reserved=0>



Re: [slurm-users] Running job using our serial queue

2019-11-07 Thread David Baker
Hi Marcus,

Thank you for your reply. Your comments regarding the oom_killer sounds 
interesting. Looking at the slurmd logs on the serial nodes I see that the 
oom_killer is very active on a typical day, and so I suspect you're likely on 
to something there. As you might expect memory is configured as a resource on 
these shared nodes and users should take care to request sufficient memory for 
their job. More often than none I guess that users are wrongly assuming that 
the default memory allocation is sufficient.

Best regards,
David

From: Marcus Wagner 
Sent: 06 November 2019 09:53
To: David Baker ; slurm-users@lists.schedmd.com 
; juergen.s...@uni-ulm.de 

Subject: Re: [slurm-users] Running job using our serial queue

Hi David,

if I remember right (we have disabled swap for years now), swapping out 
processes seem to slow down the system overall.
But I know, that if the oom_killer does its job (killing over memory 
processes), the whole system is stalled until it has done its work. This might 
be the issue, your users see.

Hwloc at least should help the scheduler to decide, where to place processes, 
but if I remember right, slurm has to be built with hwloc support (meaning at 
least hwloc-devel has to be installed).
But this part is more guessing, than knowing.

Best
Marcus

On 11/5/19 11:58 AM, David Baker wrote:
Hello,

Thank you for your replies. I double checked that the "task" in, for example, 
taskplugin=task/affinity is optional. In this respect it is good to know that 
we have  the correct cgroups setup. So in theory users should only disturb 
themselves, however in reality we find that there is often a knock on effect on 
other users' jobs. So, for example, users have complained that their jobs 
sometimes stall. I can only vaguely think that something odd is going on at the 
kernel level perhaps.

One additional thing that I need to ask is... Should we have hwloc installed 
our compute nodes? Does that help? Whenever I check which processes are not 
being constrained by cgroups I only ever find a small group of system processes.

Best regards,
David





From: slurm-users 
<mailto:slurm-users-boun...@lists.schedmd.com>
 on behalf of Marcus Wagner 
<mailto:wag...@itc.rwth-aachen.de>
Sent: 05 November 2019 07:47
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> 
<mailto:slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Running job using our serial queue

Hi David,

doing it the way you do it, is the same way, we do it.

When the Matlab job asks for one CPU, it only gets on CPU this way. That means, 
that all the processes are bound to this one CPU. So (theoretically) the user 
is just disturbing himself, if he uses more.

But especially Matlab, there are more things to do. I t does not suffice to add 
'-singleCompThread' to the commandline. Matlab is not the only tool, that tries 
to use all cores, it finds on the node.
The same is valid for CPLEX and Gurobi, both often used from Matlab. So even, 
if the user sets '-singleCompThread' for Matlab, that does not mean at all, the 
job is only using one CPU.


Best
Marcus

On 11/4/19 4:14 PM, David Baker wrote:
Hello,

We decided to route all jobs requesting from 1 to 20 cores to our serial queue. 
Furthermore, the nodes controlled by the serial queue are shared by multiple 
users. We did this to try to reduce the level of fragmentation across the 
cluster -- our default "batch" queue provides exclusive access to compute nodes.

It looks like the downside of the serial queue is that jobs from different 
users can interact quite badly. To some extent this is an education issue -- 
for example matlab users need to be told to add the "-singleCompThread" option 
to their command line. On the other hand I wonder if our cgroups setup is 
optimal for the serial queue. Our cgroup.conf contains...

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
TaskAffinity=no

CgroupMountpoint=/sys/fs/cgroup

The relevant cgroup configuration in the slurm.conf is...
ProctrackType=proctrack/cgroup
TaskPlugin=affinity,cgroup

Could someone please advise us on the required/recommended cgroup setup for the 
above scenario? For example, should we really set "TaskAffinity=yes"? I assume 
the interaction between jobs (sometimes jobs can get stalled) is due to context 
switching at the kernel level, however (apart from educating users) how can we 
minimise that switching on the serial nodes?

Best regards,
David



--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de<mailto:wag...@itc.rwth-aachen.de>
www.itc.rwth-aachen.de<https://eur03

[slurm-users] oom-kill events for no good reason

2019-11-07 Thread David Baker
Hello,

We are dealing with some weird issue on our shared nodes where job appear to be 
stalling for some reason. I was advised that this issue might be related to the 
oom-killer process. We do see a lot of these events. In fact when I started to 
take a closer look this afternoon I noticed that all jobs on all nodes (not 
just the shared nodes) are "firing" the oom-killer for some reason when they 
finish.

As a demo I launched a very simple (low memory usage) test jobs on a shared 
node  and then after a few minutes cancelled it to show the behaviour. Looking 
in the slurmd.log -- see below -- we see the oom-killer being fired for no good 
reason. This "feels" vaguely similar to this bug -- 
https://bugs.schedmd.com/show_bug.cgi?id=5121 which I understand was patched 
back in SLURM v17 (we are using v18*).

Has anyone else seen this behaviour? Or more to the point does anyone 
understand this behaviour and know how to squash it, please?

Best regards,
David

[2019-11-07T16:14:52.551] Launching batch job 164978 for UID 57337
[2019-11-07T16:14:52.559] [164977.batch] task/cgroup: 
/slurm/uid_57337/job_164977: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.560] [164977.batch] task/cgroup: 
/slurm/uid_57337/job_164977/step_batch: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.584] [164978.batch] task/cgroup: 
/slurm/uid_57337/job_164978: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.584] [164978.batch] task/cgroup: 
/slurm/uid_57337/job_164978/step_batch: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.960] [164977.batch] task_p_pre_launch: Using 
sched_affinity for tasks
[2019-11-07T16:14:52.960] [164978.batch] task_p_pre_launch: Using 
sched_affinity for tasks
[2019-11-07T16:16:05.859] [164977.batch] error: *** JOB 164977 ON gold57 
CANCELLED AT 2019-11-07T16:16:05 ***
[2019-11-07T16:16:05.882] [164977.extern] _oom_event_monitor: oom-kill event 
count: 1
[2019-11-07T16:16:05.886] [164977.extern] done with job


Re: [slurm-users] oom-kill events for no good reason

2019-11-12 Thread David Baker
Hello,

Thank you all for your useful replies. I double checked that the oom-killer 
"fires" at the end of every job on our cluster. As you mention this isn't 
significant and not something to be concerned about.

Best regards,
David

From: slurm-users  on behalf of Marcus 
Wagner 
Sent: 08 November 2019 13:00
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] oom-kill events for no good reason

Hi David,

yes, I see these messages also. I also think, this is more likely a wrong 
message. If a job has been cancelled by the OOM-Killer, you can see this with 
sacct, e.g.
$> sacct -j 10816098
   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode
 -- -- -- -- -- 
10816098   VASP_MPI   c18mdefault 12 OUT_OF_ME+0:125
10816098.ba+  batch   default 12 OUT_OF_ME+0:125
10816098.ex+ extern   default 12  COMPLETED  0:0
10816098.0 vasp_mpi   default 12 OUT_OF_ME+0:125

Best
Marcus

On 11/7/19 5:36 PM, David Baker wrote:
Hello,

We are dealing with some weird issue on our shared nodes where job appear to be 
stalling for some reason. I was advised that this issue might be related to the 
oom-killer process. We do see a lot of these events. In fact when I started to 
take a closer look this afternoon I noticed that all jobs on all nodes (not 
just the shared nodes) are "firing" the oom-killer for some reason when they 
finish.

As a demo I launched a very simple (low memory usage) test jobs on a shared 
node  and then after a few minutes cancelled it to show the behaviour. Looking 
in the slurmd.log -- see below -- we see the oom-killer being fired for no good 
reason. This "feels" vaguely similar to this bug -- 
https://bugs.schedmd.com/show_bug.cgi?id=5121<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D5121&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cb280bfbe58bb495bbace08d7644c9e52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=g%2BT6zIZqTr8ZAi52RgFRaMViwdxZPjkEOkvNa6YEXRU%3D&reserved=0>
 which I understand was patched back in SLURM v17 (we are using v18*).

Has anyone else seen this behaviour? Or more to the point does anyone 
understand this behaviour and know how to squash it, please?

Best regards,
David

[2019-11-07T16:14:52.551] Launching batch job 164978 for UID 57337
[2019-11-07T16:14:52.559] [164977.batch] task/cgroup: 
/slurm/uid_57337/job_164977: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.560] [164977.batch] task/cgroup: 
/slurm/uid_57337/job_164977/step_batch: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.584] [164978.batch] task/cgroup: 
/slurm/uid_57337/job_164978: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.584] [164978.batch] task/cgroup: 
/slurm/uid_57337/job_164978/step_batch: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.960] [164977.batch] task_p_pre_launch: Using 
sched_affinity for tasks
[2019-11-07T16:14:52.960] [164978.batch] task_p_pre_launch: Using 
sched_affinity for tasks
[2019-11-07T16:16:05.859] [164977.batch] error: *** JOB 164977 ON gold57 
CANCELLED AT 2019-11-07T16:16:05 ***
[2019-11-07T16:16:05.882] [164977.extern] _oom_event_monitor: oom-kill event 
count: 1
[2019-11-07T16:16:05.886] [164977.extern] done with job


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de<mailto:wag...@itc.rwth-aachen.de>
www.itc.rwth-aachen.de<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.itc.rwth-aachen.de&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cb280bfbe58bb495bbace08d7644c9e52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=%2Bk3%2BvCTzz%2ByeelQ96SPB5N0EoXCtWp0mrX9pFrUsHHk%3D&reserved=0>



[slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread David Baker
Hello,

Our SLURM cluster is relatively small. We have 350 standard compute nodes each 
with 40 cores. The largest job that users  can run on the partition is one 
requesting 32 nodes. Our cluster is a general university research resource and 
so there are many different sizes of jobs ranging from single core jobs, that 
get routed to a serial partition via the job-submit.lua, through to jobs 
requesting 32 nodes. When we first started the service, 32 node jobs were 
typically taking in the region of 2 days to schedule -- recently queuing times 
have started to get out of hand. Our setup is essentially...

PriorityFavorSmall=NO
FairShareDampeningFactor=5
PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0

PriorityWeightAge=40
PriorityWeightPartition=1000
PriorityWeightJobSize=50
PriorityWeightQOS=100
PriorityMaxAge=7-0

To try to reduce the queuing times for our bigger jobs should we potentially 
increase the PriorityWeightJobSize factor in the first instance to bump up the 
priority of such jobs? Or should we potentially define a set of QOSs which we 
assign to jobs in our job_submit.lua depending on the size of the job. In other 
words, let's say there is large QOS that give the largest jobs a higher 
priority, and also limits how many of those jobs that a single user can submit?

Your advice would be appreciated, please. At the moment these large jobs are 
not accruing a sufficiently high priority to rise above the other jobs in the 
cluster.

Best regards,
David


Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread David Baker
Hello,

Thank you for your reply. in answer to Mike's questions...

Our serial partition nodes are partially shared by the high memory partition. 
That is, the partitions overlap partially -- shared nodes move one way or 
another depending upon demand. Jobs requesting up to and including 20 cores are 
routed to the serial queue. The serial nodes are shared resources. In other 
words, jobs from different users can share the nodes. The maximum time for 
serial jobs is 60 hours.

Overtime there hasn't been any particular change in the time that users are 
requesting. Likewise I'm convinced that the overall job size spread is the same 
over time. What has changed is the increase in the number of smaller jobs. That 
is, one node jobs that are exclusive (can't be routed to the serial queue) or 
that require more then 20 cores, and also jobs requesting up to 10/15 nodes 
(let's say). The user base has increased dramatically over the last 6 months or 
so.

This over population is leading to the delay in scheduling the larger jobs. 
Given the size of the cluster we may need to make decisions regarding which 
types of jobs we allow to "dominate" the system. The larger jobs at the expense 
of the small fry for example, however that is a difficult decision that means 
that someone has got to wait longer for results..

Best regards,
David

From: slurm-users  on behalf of Renfro, 
Michael 
Sent: 31 January 2020 13:27
To: Slurm User Community List 
Subject: Re: [slurm-users] Longer queuing times for larger jobs

Greetings, fellow general university resource administrator.

Couple things come to mind from my experience:

1) does your serial partition share nodes with the other non-serial partitions?

2) what’s your maximum job time allowed, for serial (if the previous answer was 
“yes”) and non-serial partitions? Are your users submitting particularly longer 
jobs compared to earlier?

3) are you using the backfill scheduler at all?

--
Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services
931 372-3601  / Tennessee Tech University

On Jan 31, 2020, at 6:23 AM, David Baker  wrote:

Hello,

Our SLURM cluster is relatively small. We have 350 standard compute nodes each 
with 40 cores. The largest job that users  can run on the partition is one 
requesting 32 nodes. Our cluster is a general university research resource and 
so there are many different sizes of jobs ranging from single core jobs, that 
get routed to a serial partition via the job-submit.lua, through to jobs 
requesting 32 nodes. When we first started the service, 32 node jobs were 
typically taking in the region of 2 days to schedule -- recently queuing times 
have started to get out of hand. Our setup is essentially...

PriorityFavorSmall=NO
FairShareDampeningFactor=5
PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0

PriorityWeightAge=40
PriorityWeightPartition=1000
PriorityWeightJobSize=50
PriorityWeightQOS=100
PriorityMaxAge=7-0

To try to reduce the queuing times for our bigger jobs should we potentially 
increase the PriorityWeightJobSize factor in the first instance to bump up the 
priority of such jobs? Or should we potentially define a set of QOSs which we 
assign to jobs in our job_submit.lua depending on the size of the job. In other 
words, let's say there is large QOS that give the largest jobs a higher 
priority, and also limits how many of those jobs that a single user can submit?

Your advice would be appreciated, please. At the moment these large jobs are 
not accruing a sufficiently high priority to rise above the other jobs in the 
cluster.

Best regards,
David


Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread David Baker
Hello,

Thank you for your detailed reply. That’s all very useful. I manage to mistype 
our cluster size since there are actually 450 standard compute, 40 core, 
compute nodes. What you say is interesting and so it concerns me that things 
are so bad at the moment,

I wondered if you could please give me some more details of how you use TRES to 
throttle user activity. We have applied some limits to throttle users, however 
perhaps not enough or not well enough. So the details of what you do would be 
really appreciated, please.

In addition, we do use backfill, however we rarely see nodes being freed up in 
the cluster to make way for high priority work which again concerns me. If you 
could please share your backfill configuration then that would be appreciated, 
please.

Finally, which version of Slurm are you running? We are using an early release 
of v18.

Best regards,
David


From: slurm-users  on behalf of Renfro, 
Michael 
Sent: 31 January 2020 17:23:05
To: Slurm User Community List 
Subject: Re: [slurm-users] Longer queuing times for larger jobs

I missed reading what size your cluster was at first, but found it on a second 
read. Our cluster and typical maximum job size scales about the same way, 
though (our users’ typical job size is anywhere from a few cores up to 10% of 
our core count).

There are several recommendations to separate your priority weights by an order 
of magnitude or so. Our weights are dominated by fairshare, and we effectively 
ignore all other factors.

We also put TRES limits on by default, so that users can’t queue-stuff beyond a 
certain limit (any jobs totaling under around 1 cluster-day can be in a running 
or queued state, and anything past that is ignored until their running jobs 
burn off some of their time). This allows other users’ jobs to have a chance to 
run if resources are available, even if they were submitted well after the 
heavy users’ blocked jobs.

We also make extensive use of the backfill scheduler to run small, short jobs 
earlier than their queue time might allow, if and only if they don’t delay 
other jobs. If a particularly large job is about to run, we can see the nodes 
gradually empty out, which opens up lots of capacity for very short jobs.

Overall, our average wait times since September 2017 haven’t exceeded 90 hours 
for any job size, and I’m pretty sure a *lot* of that wait is due to a few 
heavy users submitting large numbers of jobs far beyond the TRES limit. Even 
our jobs of 5-10% cluster size have average start times of 60 hours or less 
(and we've managed under 48 hours for those size jobs for all but 2 months of 
that period), but those larger jobs tend to be run by our lighter users, and 
they get a major improvement to their queue time due to being far below their 
fairshare target.

We’ve been running at >50% capacity since May 2018, and >60% capacity since 
December 2018, and >80% capacity since February 2019. So our wait times aren’t 
due to having a ton of spare capacity for extended periods of time.

Not sure how much of that will help immediately, but it may give you some ideas.

> On Jan 31, 2020, at 10:14 AM, David Baker  wrote:
>
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hello,
>
> Thank you for your reply. in answer to Mike's questions...
>
> Our serial partition nodes are partially shared by the high memory partition. 
> That is, the partitions overlap partially -- shared nodes move one way or 
> another depending upon demand. Jobs requesting up to and including 20 cores 
> are routed to the serial queue. The serial nodes are shared resources. In 
> other words, jobs from different users can share the nodes. The maximum time 
> for serial jobs is 60 hours.
>
> Overtime there hasn't been any particular change in the time that users are 
> requesting. Likewise I'm convinced that the overall job size spread is the 
> same over time. What has changed is the increase in the number of smaller 
> jobs. That is, one node jobs that are exclusive (can't be routed to the 
> serial queue) or that require more then 20 cores, and also jobs requesting up 
> to 10/15 nodes (let's say). The user base has increased dramatically over the 
> last 6 months or so.
>
> This over population is leading to the delay in scheduling the larger jobs. 
> Given the size of the cluster we may need to make decisions regarding which 
> types of jobs we allow to "dominate" the system. The larger jobs at the 
> expense of the small fry for example, however that is a difficult decision 
> that means that someone has got to wait longer for results..
>
> Best regards,
> David
> From: slurm-users  on behalf of 
> Renfro, Michael 
> Sent: 31 Janu

Re: [slurm-users] Longer queuing times for larger jobs

2020-02-04 Thread David Baker
Hello,

Thank you very much again for your comments and the details of your slurm 
configuration. All the information is really useful. We are working on our 
cluster right now and making some appropriate changes. We'll see how we get on 
over the next 24 hours or so.

Best regards,
David

From: slurm-users  on behalf of Renfro, 
Michael 
Sent: 31 January 2020 22:08
To: Slurm User Community List 
Subject: Re: [slurm-users] Longer queuing times for larger jobs

Slurm 19.05 now, though all these settings were in effect on 17.02 until quite 
recently. If I get some detail wrong below, I hope someone will correct me. But 
this is our current working state. We’ve been able to schedule 10-20k jobs per 
month since late 2017, and we successfully scheduled 320k jobs over December 
and January (largely due to one user using some form of automated submission 
for very short jobs).

Basic scheduler setup:

As I’d said previously, we prioritize on fairshare almost exclusively. Most of 
our jobs (molecular dynamics, CFD) end up in a single batch partition, since 
GPU and big-memory jobs have other partitions.

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=10
PriorityWeightAge=1000
PriorityWeightPartition=1
PriorityWeightJobSize=1000
PriorityMaxAge=1-0

TRES limits:

We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser set 
grptresrunmin=cpu=144 — there might be a way of doing this at a higher 
accounting level, but it works as is.

We also force QoS=gpu in each GPU partition’s definition in slurm.conf, and set 
MaxJobsPerUser equal to our total GPU count. That helps prevent users from 
queue-stuffing the GPUs even if they stay well below the 1000 CPU-day TRES 
limit above.

Backfill:

  SchedulerType=sched/backfill
  
SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200

Can’t remember where I found the backfill guidance, but:

- bf_window is set to our maximum job length (30 days) and bf_resolution is set 
to 1.5 days. Most of our users’ jobs are well over 1 day.
- We have had users who didn’t use job arrays, and submitted a ton of small 
jobs at once, thus bf_max_job_user gives the scheduler a chance to start up to 
80 jobs per user each cycle. This also prompted us to increase 
default_queue_depth, so the backfill scheduler would examine more jobs each 
cycle.
- bf_continue should let the backfill scheduler continue where it left off if 
it gets interrupted, instead of having to start from scratch each time.

I can guarantee you that our backfilling was sub-par until we tuned these 
parameters (or at least a few users could find a way to submit so many jobs 
that the backfill couldn’t keep up, even when we had idle resources for their 
very short jobs).

> On Jan 31, 2020, at 3:01 PM, David Baker  wrote:
>
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hello,
>
> Thank you for your detailed reply. That’s all very useful. I manage to 
> mistype our cluster size since there are actually 450 standard compute, 40 
> core, compute nodes. What you say is interesting and so it concerns me that 
> things are so bad at the moment,
>
> I wondered if you could please give me some more details of how you use TRES 
> to throttle user activity. We have applied some limits to throttle users, 
> however perhaps not enough or not well enough. So the details of what you do 
> would be really appreciated, please.
>
> In addition, we do use backfill, however we rarely see nodes being freed up 
> in the cluster to make way for high priority work which again concerns me. If 
> you could please share your backfill configuration then that would be 
> appreciated, please.
>
> Finally, which version of Slurm are you running? We are using an early 
> release of v18.
>
> Best regards,
> David
>
> From: slurm-users  on behalf of 
> Renfro, Michael 
> Sent: 31 January 2020 17:23:05
> To: Slurm User Community List 
> Subject: Re: [slurm-users] Longer queuing times for larger jobs
>
> I missed reading what size your cluster was at first, but found it on a 
> second read. Our cluster and typical maximum job size scales about the same 
> way, though (our users’ typical job size is anywhere from a few cores up to 
> 10% of our core count).
>
> There are several recommendations to separate your priority weights by an 
> order of magnitude or so. Our weights are dominated by fairshare, and we 
> effectively ignore all other factors.
>
> We also put TRES limits on by default, so that users can’t queue-stuff beyond 
> a certain limit (any jobs totaling under around 1 cluster-

Re: [slurm-users] Longer queuing times for larger jobs

2020-02-04 Thread David Baker
Hello,

I've taken a very good look at our cluster, however as yet not made any 
significant changes. The one change that I did make was to increase the 
"jobsizeweight". That's now our dominant parameter and it does ensure that our 
largest jobs (> 20 nodes) are making it to the top of the sprio listing which 
is what we want to see.

These large jobs aren't making an progress despite the priority lift. I 
additionally decreased the nice value of the job that sparked this discussion. 
That is (looking at at sprio) there is a 32 node job with a very high 
priority...

JOBID PARTITION USER   PRIORITYAGE  FAIRSHAREJOBSIZE  PARTITION 
   QOSNICE
280919 batch  mep1c101275481 40  59827 415655  
0  0 -40

That job has been sitting in the queue for well over a week and it is 
disconcerting that we never see nodes becoming idle in order to service these 
large jobs. Nodes do become idle and then get scooped by jobs started by 
backfill. Looking at the slurmctld logs I see that the  vast majority of jobs 
are being started via backfill -- including, for example, a 24 node job. I see 
very few jobs allocated by the scheduler. That is, messages like sched: 
Allocate JobId=296915 are few and far between and I never see any of the large 
jobs being allocated in the batch queue.

Surely, this is not correct, however does anyone have any advice on what to 
check, please?

Best regards,
David

From: slurm-users  on behalf of Killian 
Murphy 
Sent: 04 February 2020 10:48
To: Slurm User Community List 
Subject: Re: [slurm-users] Longer queuing times for larger jobs

Hi David.

I'd love to hear back about the changes that you make and how they affect the 
performance of your scheduler.

Any chance you could let us know how things go?

Killian

On Tue, 4 Feb 2020 at 10:43, David Baker 
mailto:d.j.ba...@soton.ac.uk>> wrote:
Hello,

Thank you very much again for your comments and the details of your slurm 
configuration. All the information is really useful. We are working on our 
cluster right now and making some appropriate changes. We'll see how we get on 
over the next 24 hours or so.

Best regards,
David

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Renfro, Michael mailto:ren...@tntech.edu>>
Sent: 31 January 2020 22:08
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] Longer queuing times for larger jobs

Slurm 19.05 now, though all these settings were in effect on 17.02 until quite 
recently. If I get some detail wrong below, I hope someone will correct me. But 
this is our current working state. We’ve been able to schedule 10-20k jobs per 
month since late 2017, and we successfully scheduled 320k jobs over December 
and January (largely due to one user using some form of automated submission 
for very short jobs).

Basic scheduler setup:

As I’d said previously, we prioritize on fairshare almost exclusively. Most of 
our jobs (molecular dynamics, CFD) end up in a single batch partition, since 
GPU and big-memory jobs have other partitions.

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=10
PriorityWeightAge=1000
PriorityWeightPartition=1
PriorityWeightJobSize=1000
PriorityMaxAge=1-0

TRES limits:

We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser set 
grptresrunmin=cpu=144 — there might be a way of doing this at a higher 
accounting level, but it works as is.

We also force QoS=gpu in each GPU partition’s definition in slurm.conf, and set 
MaxJobsPerUser equal to our total GPU count. That helps prevent users from 
queue-stuffing the GPUs even if they stay well below the 1000 CPU-day TRES 
limit above.

Backfill:

  SchedulerType=sched/backfill
  
SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200

Can’t remember where I found the backfill guidance, but:

- bf_window is set to our maximum job length (30 days) and bf_resolution is set 
to 1.5 days. Most of our users’ jobs are well over 1 day.
- We have had users who didn’t use job arrays, and submitted a ton of small 
jobs at once, thus bf_max_job_user gives the scheduler a chance to start up to 
80 jobs per user each cycle. This also prompted us to increase 
default_queue_depth, so the backfill scheduler would examine more jobs each 
cycle.
- bf_continue should let the backfill scheduler continue where it left off if 
it gets interrupted, instead of having to start from scratch each time.

I can guarantee you that our backfilling was sub-par until we tuned these 
parameters (or at least a few users could find a way to submit so many jobs 
that the backfill couldn’t keep up, even when we had 

[slurm-users] Advice on using GrpTRESRunMin=cpu=

2020-02-12 Thread David Baker
Hello,

Before implementing "GrpTRESRunMin=cpu=limit" on our production cluster I'm 
doing some tests on the development cluster. I've only get a handful of compute 
nodes to play without and so I have set the limit sensibly low. That is, I've 
set the limit to be 576,000. That's equivalent to 400 CPU-days. In other words, 
I can potentially submit the following job...

1 x 2 nodes x 80 cpus/node x 2.5 days = 400 CPU-days

I submitted a set of jobs requesting 2 nodes, 80 cpus/node for 2.5 days. The 
first day is running and the rest are in the queue -- what I see makes sense...

JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
677 debugmyjob djb1 PD   0:00  2 
(AssocGrpCPURunMinutesLimit)
678 debugmyjob djb1 PD   0:00  2 
(AssocGrpCPURunMinutesLimit)
679 debugmyjob djb1 PD   0:00  2 
(AssocGrpCPURunMinutesLimit)
676 debugmyjob djb1  R  12:52  2 navy[54-55]

On the other hand, I expected these jobs not to accrue priority, however they 
do appear to be (see sprio below). I'm working with Slurm v19.05.2. Have I 
missed something vital/important in the config? We hoped that the queued jobs 
would not accrue priority. We haven't, for example, used "accrue always". Have 
I got that wrong? Could someone please advise us.

Best regards,
David

[root@navy51 slurm]# sprio
  JOBID PARTITION   PRIORITY   SITEAGE  FAIRSHARE
JOBSIZEQOS
677 debug5551643 10   1644 45
500  0
678 debug5551643 10   1644 45
500  0
679 debug5551642 10   1643 45
500  0


Re: [slurm-users] Job with srun is still RUNNING after node reboot

2020-03-31 Thread David Rhey
Hi, Yair,

Out of curiosity have you checked to see if this is a runaway job?

David

On Tue, Mar 31, 2020 at 7:49 AM Yair Yarom  wrote:

> Hi,
>
> We have an issue where running srun (with --pty zsh), and rebooting the
> node (from a different shell), the srun reports:
> srun: error: eio_message_socket_accept: slurm_receive_msg[an.ip.addr.ess]:
> Zero Bytes were transmitted or received
> and hangs.
>
> After the node boots, the slurm claims that job is still RUNNING, and srun
> is still alive (but not responsive).
>
> I've tried it with various configurations (select/linear,
> select/cons_tres, jobacct_gather/linux, jobacct_gather/cgroup, task/none,
> task/cgroup), with the same results. We're using 19.05.1.
> Running with sbatch causes the job to be in the more appropriate NODE_FAIL
> state instead.
>
> Anyone else encountered this? or know how to make the job state not
> RUNNING after it's clearly not running?
>
> Thanks in advance,
> Yair.
>
>

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


Re: [slurm-users] Drain a single user's jobs

2020-04-01 Thread David Rhey
Hi Mark,

I *think* you might need to update the user account to have access to that
QoS (as part of their association). Using sacctmgr modify user  + some
additional args (they escape me at the moment).

Also, you *might* have been able to set the MaxSubmitJobs at their account
level to 0 and have them run without having to do the QoS approach - but
that's just a guess on my end based on how we've done some things here. We
had a "free period" for our clusters and once it was over we set the
GrpSubmit jobs on an account to 0 which allowed in-flight jobs to continue
but no new work to be submitted.

HTH,

David

On Wed, Apr 1, 2020 at 5:57 AM Mark Dixon  wrote:

> Hi all,
>
> I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster.
>
> I'd like to stop user foo from submitting new jobs but allow their
> existing jobs to run.
>
> We have several partitions, each with its own qos and MaxSubmitJobs
> typically set to some vaue. These qos are stopping a "sacctmgr update user
> foo set maxsubmitjobs=0" from doing anything useful, as per the
> documentation.
>
> I've tried setting up a competing qos:
>
>sacctmgr add qos drain
>sacctmgr modify qos drain set MaxSubmitJobs=0
>sacctmgr modify qos drain set flags=OverPartQOS
>sacctmgr modify user foo set qos=drain
>
> This has successfully prevented the user from submitting new jobs, but
> their existing jobs aren't running. I'm seeing the reason code
> "InvalidQOS".
>
> Any ideas what I should be looking at, please?
>
> Thanks,
>
> Mark
>
>

-- 
David Rhey
---
Advanced Research Computing - Technology Services
University of Michigan


[slurm-users] Slurm unlink error messages -- what do they mean?

2020-04-23 Thread David Baker
Hello,

We have, rather belatedly, just upgraded to Slurm v19.05.5. On the whole, so 
far so good -- no major problems. One user has complained that his job now 
crashes and reports an unlink error. That is..


slurmstepd: error: get_exit_code task 0 died by signal: 9
slurmstepd: error: unlink(/tmp/slurmd/job392987/slurm_script): No such file or 
directory

I suspect that this message has something to do with the completion of one of 
the steps in his job. Apparently his job is quite complex with a number of 
inter-related tasks.

Significantly, we decided to switch from an rpm to a 'build from source' 
installation. In other words, we did have rpms on each node in the cluster, but 
now have slurm installed on a global file system. Does anyone have any thoughts 
regarding the above issue, please? I'm still to see the user's script and so 
there might be a good logical explanation for the message on inspection.

Best regards,
David


Re: [slurm-users] Job Step Resource Requests are Ignored

2020-05-06 Thread David Braun
i'm not sure I understand the problem.  If you want to make sure the
preamble and postamble run even if the main job doesn't run you can use '-d'

from the man page

-d, --dependency=
  Defer   the   start   of   this   job   until   the
specified   dependencies   have   been   satisfied   completed.
is   of   the  form
   or
.  All dependencies must be
satisfied  if  the  ","  separator  is
  used.   Any  dependency  may  be  satisfied  if  the "?"
separator is used.  Many jobs can share the same dependency and these jobs
may even belong to different
  users. The  value may be changed after job submission using
the scontrol command.  Once a job dependency fails due to the termination
state of a preceding  job,
  the dependent job will never be run, even if the preceding
job is requeued and has a different termination state in a subsequent
execution.


for instance, create a job that contains this:

preamble_id=`sbatch preamble.job`
main_id=`sbatch -d afterok:$preamble_id main.job`
sbatch -d afterany:$main_id postamble.job

Best,

D

On Wed, May 6, 2020 at 2:19 PM Maria Semple  wrote:

> Hi Chris,
>
> I think my question isn't quite clear, but I'm also pretty confident the
> answer is no at this point. The idea is that the script is sort of like a
> template for running a job, and an end user can submit a custom job with
> their own desired resource requests which will end up filling in the
> template. I'm not in control of the Slurm cluster that will ultimately run
> the job, nor the details of the job itself. For example, template-job.sh
> might look like this:
>
> #!/bin/bash
> srun -c 1 --mem=1k echo "Preamble"
> srun -c  --mem=m /bin/sh -c 
> srun -c 1 --mem=1k echo "Postamble"
>
> My goal is that even if the user requests 10 CPUs when the cluster only
> has 4 available, the Preamble and Postamble steps will always run. But as I
> said, it seems like that's not possible since the maximum number of CPUs
> needs to be set on the sbatch allocation and the whole job would be
> rejected on the basis that too many CPUs were requested. Is that correct?
>
> On Tue, May 5, 2020, 11:13 PM Chris Samuel  wrote:
>
>> On Tuesday, 5 May 2020 11:00:27 PM PDT Maria Semple wrote:
>>
>> > Is there no way to achieve what I want then? I'd like the first and
>> last job
>> > steps to always be able to run, even if the second step needs too many
>> > resources (based on the cluster).
>>
>> That should just work.
>>
>> #!/bin/bash
>> #SBATCH -c 2
>> #SBATCH -n 1
>>
>> srun -c 1 echo hello
>> srun -c 4 echo big wide
>> srun -c 1 echo world
>>
>> gives:
>>
>> hello
>> srun: Job step's --cpus-per-task value exceeds that of job (4 > 2). Job
>> step
>> may never run.
>> srun: error: Unable to create step for job 604659: More processors
>> requested
>> than permitted
>> world
>>
>> > As a side note, do you know why it's not even possible to restrict the
>> > number of resources a single step uses (i.e. set less CPUs than are
>> > available to the full job)?
>>
>> My suspicion is that you've not set up Slurm to use cgroups to restrict
>> the
>> resources a job can use to just those requested.
>>
>> https://slurm.schedmd.com/cgroups.html
>>
>> All the best,
>> Chris
>> --
>>   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>>
>>
>>
>>
>>


Re: [slurm-users] How to view GPU indices of the completed jobs?

2020-06-10 Thread David Braun
Hi Kota,

This is from the job template that I give to my users:

# Collect some information about the execution environment that may
# be useful should we need to do some debugging.

echo "CREATING DEBUG DIRECTORY"
echo

mkdir .debug_info
module list > .debug_info/environ_modules 2>&1
ulimit -a > .debug_info/limits 2>&1
hostname > .debug_info/environ_hostname 2>&1
env |grep SLURM > .debug_info/environ_slurm 2>&1
env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1
env |grep OMPI > .debug_info/environ_openmpi 2>&1
env > .debug_info/environ 2>&1

if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
echo "SAVING CUDA ENVIRONMENT"
echo
env |grep CUDA > .debug_info/environ_cuda 2>&1
fi

You could add something like this to one of the SLURM prologs to save the
GPU list of jobs.

Best,

David

On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki <
kota.tsuyuzaki...@hco.ntt.co.jp> wrote:

> Hello Guys,
>
> We are running GPU clusters with Slurm and SlurmDBD (version 19.05 series)
> and some of GPUs seemed to get troubles for attached
> jobs. To investigate if the troubles happened on the same GPUs, I'd like
> to get GPU indices of the completed jobs.
>
> In my understanding `scontrol show job` can show the indices (as IDX in
> gres info) but cannot be used for completed job. And also
> `sacct -j` is available for complete jobs but won't print the indices.
>
> Is there any way (commands, configurations, etc...) to see the allocated
> GPU indices for completed jobs?
>
> Best regards,
>
> 
> 露崎 浩太 (Kota Tsuyuzaki)
> kota.tsuyuzaki...@hco.ntt.co.jp
> NTTソフトウェアイノベーションセンタ
> 分散処理基盤技術プロジェクト
> 0422-59-2837
> -
>
>
>
>
>
>


[slurm-users] Nodes do not return to service after scontrol reboot

2020-06-16 Thread David Baker
Hello,

We are running Slurm v19.05.5 and I am experimenting with the scontrol reboot 
command. I find that compute nodes reboot, but they are not returned to 
service. Rather they remain down following the reboot..

navy55 1debug*down   80   2:20:2 1920000   2000   
(null) Reboot ASAP : reboot

This is a diskfull node and so it doesn't take too long to reboot. For the sake 
of the argument I have set ResumeTimeOut to 1000 seconds which is well over 
what's needed...

[root@navy51 slurm]# grep -i resume slurm.conf
ResumeTimeout=1000
[root@navy51 slurm]# grep -i return slurm.conf
ReturnToService=0
[root@navy51 slurm]# grep -i nhc slurm.conf
# LBNL Node Health Check (NHC)
#HealthCheckProgram=/usr/sbin/nhc

For this experiment I have disabled the health checker, and I don't think 
setting ReturnToService=1 helps. Could anyone please help with this? We are 
about to update the node firmware and ensuring that the nodes are returned to 
service following their reboot would be useful.

Best regards,
David


Re: [slurm-users] Nodes do not return to service after scontrol reboot

2020-06-17 Thread David Baker
Hello Chris,

Thank you for your comments. The scontrol reboot command is now working as 
expected.

Best regards,
David


From: slurm-users  on behalf of 
Christopher Samuel 
Sent: 16 June 2020 18:16
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Nodes do not return to service after scontrol reboot

On 6/16/20 8:16 am, David Baker wrote:

> We are running Slurm v19.05.5 and I am experimenting with the *scontrol
> reboot * command. I find that compute nodes reboot, but they are not
> returned to service. Rather they remain down following the reboot..

How are you using "scontrol reboot" ?

We do:

scontrol reboot ASAP nextstate=resume reason=$REASON $NODE

Which works for us (and we have health checks in our epilog that can
trigger this for known issues like running low on unfragmented huge pages).

All the best,
Chris
--
   Chris Samuel  :  
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C6fa4d9db3b0e47f6a03308d812197d60%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=V9%2Fytt3ActVODtPjD%2FXAB2w5TvVhSJDYJ9%2B0xUmJRUU%3D&reserved=0
  :  Berkeley, CA, USA



[slurm-users] Slurm and shared file systems

2020-06-19 Thread David Baker
Hello,

We are currently helping a research group to set up their own Slurm cluster. 
They have asked a very interesting question about Slurm and file systems. That 
is, they are posing the question -- do you need a shared user file store on a 
Slurm cluster?

So, in the extreme case where this is no shared file store for users can slurm 
operate properly over a cluster? I have seen commands like sbcast to move a 
file from the submission node to a compute node, however that command can only 
transfer one file at a time. Furthermore what would happen to the standard 
output files? I'm going to guess that there must be a shared file system, 
however it would be good if someone could please confirm this.

Best regards,
David




[slurm-users] Slurm -- using GPU cards with NVLINK

2020-09-10 Thread David Baker
Hello,

We are installing a group of nodes which all contain 4 GPU cards. The GPUs are 
paired together using NVLINK as described in the matrix below.

We are familiar with using Slurm to schedule and run jobs on GPU cards, but 
this is the first time we have dealt with NVLINK enabled GPUs. Could someone 
please advise us how to configure Slurm so that we can submit jobs to the cards 
and make use of the NVLINK? That is, what do we need to put in the gres.conf or 
slurm.conf, and how should users use the sbatch command? I presume, for 
example, that a user could make use of a GPU card, and potentially make use of 
memory on the paired card.

Best regards,
David

[root@alpha51 ~]# nvidia-smi topo --matrix
GPU0GPU1GPU2GPU3CPU AffinityNUMA Affinity
GPU0 X  NV2 SYS SYS 0,2,4,6,8,100
GPU1NV2  X  SYS SYS 0,2,4,6,8,100
GPU2SYS SYS  X  NV2 1,3,5,7,9,111
GPU3SYS SYS NV2  X  1,3,5,7,9,111


Re: [slurm-users] Slurm -- using GPU cards with NVLINK

2020-09-11 Thread David Baker
Hi Ryan,

Thank you very much for your reply. That is useful. We'll see how we get on.

Best regards,
David

From: slurm-users  on behalf of Ryan 
Novosielski 
Sent: 11 September 2020 00:08
To: Slurm User Community List 
Subject: Re: [slurm-users] Slurm -- using GPU cards with NVLINK

I’m fairly sure that you set this up the same way you set up for a peer-to-peer 
setup. Here’s ours:

[root@cuda001 ~]# nvidia-smi topo --matrix
GPU0GPU1GPU2GPU3mlx4_0  CPU Affinity
GPU0 X  PIX SYS SYS PHB 0-11
GPU1PIX  X  SYS SYS PHB 0-11
GPU2SYS SYS  X  PIX SYS 12-23
GPU3SYS SYS PIX  X  SYS 12-23
mlx4_0  PHB PHB SYS SYS  X

[root@cuda001 ~]# cat /etc/slurm/gres.conf

…

# 2 x K80 (perceval)
NodeName=cuda[001-008] Name=gpu File=/dev/nvidia[0-1] CPUs=0-11
NodeName=cuda[001-008] Name=gpu File=/dev/nvidia[2-3] CPUs=12-23

This also seems to be related:

https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2FSLUG19%2FGPU_Scheduling_and_Cons_Tres.pdf&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C1a052163da5d4d0643d808d855ded053%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=lV2AExQxAc7svAT2FNJHJ8TsU5pfix0GwjpQ29Cc%2B0A%3D&reserved=0

--

|| \\UTGERS,  |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Sep 10, 2020, at 11:00 AM, David Baker  wrote:
>
> Hello,
>
> We are installing a group of nodes which all contain 4 GPU cards. The GPUs 
> are paired together using NVLINK as described in the matrix below.
>
> We are familiar with using Slurm to schedule and run jobs on GPU cards, but 
> this is the first time we have dealt with NVLINK enabled GPUs. Could someone 
> please advise us how to configure Slurm so that we can submit jobs to the 
> cards and make use of the NVLINK? That is, what do we need to put in the 
> gres.conf or slurm.conf, and how should users use the sbatch command? I 
> presume, for example, that a user could make use of a GPU card, and 
> potentially make use of memory on the paired card.
>
> Best regards,
> David
>
> [root@alpha51 ~]# nvidia-smi topo --matrix
> GPU0GPU1GPU2GPU3CPU AffinityNUMA Affinity
> GPU0 X  NV2 SYS SYS 0,2,4,6,8,100
> GPU1NV2  X  SYS SYS 0,2,4,6,8,100
> GPU2SYS SYS  X  NV2 1,3,5,7,9,111
> GPU3SYS SYS NV2  X  1,3,5,7,9,111



[slurm-users] Accounts and QOS settings

2020-10-01 Thread David Baker
Hello,

I wondered if someone would be able to advise me on how to limit access to a 
group of resources, please.

We have just installed a set of 6 GPU nodes. These nodes belong to a research 
department and both staff and students will potentially need access to the 
nodes. I need to ensure that only these two groups of users have access to the 
nodes. The general public should not have access to the resources. Access to 
the nodes is a 50/50 split between the two groups, and staff should be able to 
run much longer jobs than students. Those are the constraints.

How can I do the above? I assume I put the users into two account groups -- 
staff and students, for example. Then I could use the groups to limit access to 
the partition. How do I best use a QOS to limit the number of nodes used/group 
and the walltime allowed? Should/can I apply a QOS to the account group, or the 
partition. My thought was to have two overlapping partitions each with the 
relevant QOS and account group access control. Perhaps I am making this too 
complicated. I would appreciate your advice, please.

Best regards,
David


[slurm-users] Controlling access to idle nodes

2020-10-06 Thread David Baker
Hello,

I would appreciate your advice on how to deal with this situation in Slurm, 
please. If I have a set of nodes used by 2 groups, and normally each group 
would each have access to half the nodes. So, I could limit each group to have 
access to 3 nodes each, for example. I am trying to devise a scheme that allows 
each group to make best use of the node always. In other words, each group 
could potentially use all the nodes (assuming they all free and the other group 
isn't using the nodes at all).

I cannot set hard and soft limits in slurm, and so I'm not sure how to make the 
situation flexible. Ideally It would be good for each group to be able to use 
their allocation and then take advantage of any idle nodes via a scavenging 
mechanism. The other group could then pre-empt the scavenger jobs and claim 
their nodes. I'm struggling with this since this seems like a two-way scavenger 
situation.

Could anyone please help? I have, by the way, set up partition-based 
pre-emption in the cluster. This allows the general public to scavenge nodes 
owned by research groups.

Best regards,
David




[slurm-users] unable to run on all the logical cores

2020-10-07 Thread David Bellot
Hi,

my Slurm cluster has a dozen machines configured as follows:

NodeName=foobar01 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=2 RealMemory=257243 State=UNKNOWN

and scheduling is:

# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

My problem is that only half of the logical cores are used when I run a
computation.

Let me explain: I use R and the package 'batchtools' to create jobs. All
the jobs are created under the hood with sbatch. If I log in to all the
machines in my cluster and do a 'htop', I can see that only half of the
logical cores are used. Other methods to measure the load of each machine
confirmed this "visual" clue.
My jobs ask Slurm for only one cpu per task. I tried to enforce that with
the -c 1 but it didn't make any difference.

Then I realized there was something strange:
when I do scontrol show job , I can spot the following output:

   NumNodes=1 NumCPUs=2 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:2 CoreSpec=*

that is each job uses NumCPUs=2 instead of 1. Also, I'm not sure why
TRES=cpu=2

Any idea on how to solve this problem and have 100% of the logical cores
allocated?

Best regards,
David


Re: [slurm-users] unable to run on all the logical cores

2020-10-07 Thread David Bellot
Hi Rodrigo,

good spot. At least, scontrol show job is now saying that each job only
requires one "CPU", so it seems all the cores are treated the same way now.
Though I still have the problem of not using more than half the cores. So I
suppose it might be due to the way I submit (batchtools in this case) the
jobs.
I'm still investigating even if NumCPUs=1 now as it should be. Thanks.

David

On Thu, Oct 8, 2020 at 4:40 PM Rodrigo Santibáñez <
rsantibanez.uch...@gmail.com> wrote:

> Hi David,
>
> I had the same problem time ago when configuring my first server.
>
> Could you try SelectTypeParameters=CR_CPU instead of
> SelectTypeParameters=CR_Core?
>
> Best regards,
> Rodrigo.
>
> On Thu, Oct 8, 2020, 02:16 David Bellot 
> wrote:
>
>> Hi,
>>
>> my Slurm cluster has a dozen machines configured as follows:
>>
>> NodeName=foobar01 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
>> ThreadsPerCore=2 RealMemory=257243 State=UNKNOWN
>>
>> and scheduling is:
>>
>> # SCHEDULING
>> SchedulerType=sched/backfill
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core
>>
>> My problem is that only half of the logical cores are used when I run a
>> computation.
>>
>> Let me explain: I use R and the package 'batchtools' to create jobs. All
>> the jobs are created under the hood with sbatch. If I log in to all the
>> machines in my cluster and do a 'htop', I can see that only half of the
>> logical cores are used. Other methods to measure the load of each machine
>> confirmed this "visual" clue.
>> My jobs ask Slurm for only one cpu per task. I tried to enforce that with
>> the -c 1 but it didn't make any difference.
>>
>> Then I realized there was something strange:
>> when I do scontrol show job , I can spot the following output:
>>
>>NumNodes=1 NumCPUs=2 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>TRES=cpu=2,node=1,billing=2
>>Socks/Node=* NtasksPerN:B:S:C=0:0:*:2 CoreSpec=*
>>
>> that is each job uses NumCPUs=2 instead of 1. Also, I'm not sure why
>> TRES=cpu=2
>>
>> Any idea on how to solve this problem and have 100% of the logical cores
>> allocated?
>>
>> Best regards,
>> David
>>
>

-- 
<https://www.lifetrading.com.au/>
David Bellot
Head of Quantitative Research

A. Suite B, Level 3A, 43-45 East Esplanade, Manly, NSW 2095
E. david.bel...@lifetrading.com.au
P. (+61) 0405 263012


Re: [slurm-users] Controlling access to idle nodes

2020-10-08 Thread David Baker
Thank you very much for your comments. Oddly enough, I came up with the 
3-partition model as well once I'd sent my email. So, your comments helped to 
confirm that I was thinking on the right lines.

Best regards,
David


From: slurm-users  on behalf of Thomas 
M. Payerle 
Sent: 06 October 2020 18:50
To: Slurm User Community List 
Subject: Re: [slurm-users] Controlling access to idle nodes

We use a scavenger partition, and although we do not have the policy you 
describe, it could be used in your case.

Assume you have 6 nodes (node-[0-5]) and two groups A and B.
Create partitions
partA = node-[0-2]
partB = node-[3-5]
all = node-[0-6]

Create QoSes normal and scavenger.
Allow normal QoS to preempt jobs with scavenger QoS

In sacctmgr, give members of group A access to use partA with normal QoS  and 
group B access to use partB with normal QoS
Allow both A and B to use part all with scavenger QoS.

So members of A can launch jobs on partA with normal QoS (probably want to make 
that their default), and similarly member of B can launch jobs on partB with 
normal QoS.
But membes of A can also launch jobs on partB with scavenger QoS and vica 
versa.  If the partB nodes used by A are needed by B, they will get preempted.

This is not automatic (users need to explicitly say they want to run jobs on 
the other half of the cluster), but that is probably reasonable because there 
are some jobs one does not wish to get preempted even if they have to wait a 
while in the queue to ensure such.

On Tue, Oct 6, 2020 at 11:12 AM David Baker 
mailto:d.j.ba...@soton.ac.uk>> wrote:
Hello,

I would appreciate your advice on how to deal with this situation in Slurm, 
please. If I have a set of nodes used by 2 groups, and normally each group 
would each have access to half the nodes. So, I could limit each group to have 
access to 3 nodes each, for example. I am trying to devise a scheme that allows 
each group to make best use of the node always. In other words, each group 
could potentially use all the nodes (assuming they all free and the other group 
isn't using the nodes at all).

I cannot set hard and soft limits in slurm, and so I'm not sure how to make the 
situation flexible. Ideally It would be good for each group to be able to use 
their allocation and then take advantage of any idle nodes via a scavenging 
mechanism. The other group could then pre-empt the scavenger jobs and claim 
their nodes. I'm struggling with this since this seems like a two-way scavenger 
situation.

Could anyone please help? I have, by the way, set up partition-based 
pre-emption in the cluster. This allows the general public to scavenge nodes 
owned by research groups.

Best regards,
David




--
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu<mailto:paye...@umd.edu>
5825 University Research Park   (301) 405-6135
University of Maryland
College Park, MD 20740-3831


Re: [slurm-users] unable to run on all the logical cores

2020-10-11 Thread David Bellot
Indeed, it makes sense now. However, if I launch many R processes using the
"parallel" package, I can easily have all the "logical" cores running. In
the background, if I'm correct ,R will "fork" and not create a thread. So
we have independent processes. On a 20 cores CPU for example, I have 40
"logical" cores and all the cores are running, according to htop.

With Slurm, I can't reproduce the same behavior even if I use the
SelectTypeParameters=CR_CPU.

So, is there a config to tune, an option to use in "sbatch" to achieve the
same result, or should I rather launch 20 jobs per node and have each job
split in two internally (using "parallel" or "future" for example)?

On Thu, Oct 8, 2020 at 6:32 PM William Brown 
wrote:

> R is single threaded.
>
> On Thu, 8 Oct 2020, 07:44 Diego Zuccato,  wrote:
>
>> Il 08/10/20 08:19, David Bellot ha scritto:
>>
>> > good spot. At least, scontrol show job is now saying that each job only
>> > requires one "CPU", so it seems all the cores are treated the same way
>> now.
>> > Though I still have the problem of not using more than half the cores.
>> > So I suppose it might be due to the way I submit (batchtools in this
>> > case) the jobs.
>> Maybe R is generating single-threaded code? In that case, only a single
>> process can run on a given core at a time (processes does not share
>> memory map, threads do, and on Intel CPUs there's a single MMU per core,
>> not one per thread as in some AMDs).
>>
>> --
>> Diego Zuccato
>> DIFA - Dip. di Fisica e Astronomia
>> Servizi Informatici
>> Alma Mater Studiorum - Università di Bologna
>> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
>> tel.: +39 051 20 95786
>>
>>

-- 
<https://www.lifetrading.com.au/>
David Bellot
Head of Quantitative Research

A. Suite B, Level 3A, 43-45 East Esplanade, Manly, NSW 2095
E. david.bel...@lifetrading.com.au
P. (+61) 0405 263012


[slurm-users] ninja and cmake

2020-11-24 Thread David Bellot
Hi,

I installed a cluster with 10 nodes and I'd like to try compiling a very
large code base using all the nodes. The context is as follows:
- my code base is in C++, I use gcc.
- configuration is done with CMake
- compilation is processed by ninja (something similar to make)

I can srun ninja and get the code base compiled on another node using as
many cores as I want on the other node.

Now what I want to do is to have each file being compiled as a single Slurm
job, so that I can spread my compilation over all the nodes of the cluster
and not just on one machine.

I know that ccache and distcc exist and I use them, but here I want to test
if it's possible to do it with Slurm (as a proof of concept).

Cheers,
David


[slurm-users] Backfill pushing jobs back

2020-12-09 Thread David Baker
Hello,


We see the following issue with smaller jobs pushing back large jobs. We are 
using slurm 19.05.8 so not sure if this is patched in newer releases. With a 4 
node test partition I submit 3 jobs as 2 users



ssh hpcdev1@navy51 'sbatch --nodes=3 --ntasks-per-node=40 
--partition=backfilltest --time=120 --wrap="sleep 7200"'

ssh hpcdev2@navy51 'sbatch --nodes=4 --ntasks-per-node=40 
--partition=backfilltest --time=60 --wrap="sleep 3600"'

ssh hpcdev2@navy51 'sbatch --nodes=4 --ntasks-per-node=40 
--partition=backfilltest --time=60 --wrap="sleep 3600"'



Then I increase the priority of the pending jobs significantly. Reading the 
manual, my understanding is that nodes job should be held for these jobs.

for job in $(squeue -h -p backfilltest -t pd -o %i); do scontrol update job 
${job} priority=10;done



squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T"

JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE

28482 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING

28483 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING

28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING



So, there is one node free in our 4 node partition. Naturally, a small job with 
a walltime of less than 1 hour could run in that but we are also seeing 
backfill start longer jobs.



backfilltestup 2-12:00:00  3  alloc reddev[001-003]

backfilltestup 2-12:00:00  1   idle reddev004





ssh hpcdev3@navy51 'sbatch --nodes=1 --ntasks-per-node=40 
--partition=backfilltest --time=720 --wrap="sleep 432000"'





squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T"

JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE

28482 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING

28483 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING

28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING

28484 | hpcdev3 | 40 | 37541 | 12:00:00 | 2020-12-08T09:54:48 | RUNNING



Is this expect behaviour? It is also weird that the pending jobs don't have a 
start time. I have increased the backfill parameters significantly, but it 
doesn't seem to affect this at all.



SchedulerParameters=bf_window=14400,bf_resolution=2400,bf_max_job_user=80,bf_continue,default_queue_depth=1000,bf_interval=60


Best regards,

David



Re: [slurm-users] Backfill pushing jobs back

2020-12-10 Thread David Baker
Hi Chris,

Thank you for your reply. It isn't long since we upgraded to Slurm v19, however 
it sounds like we should start to actively look at v20 since this issue is 
causing significant problems on our cluster. We're download and install v20 on 
our dev cluster, and experiment.

Best regards,
David

From: slurm-users  on behalf of Chris 
Samuel 
Sent: 09 December 2020 16:37
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Backfill pushing jobs back

CAUTION: This e-mail originated outside the University of Southampton.

Hi David,

On 9/12/20 3:35 am, David Baker wrote:

> We see the following issue with smaller jobs pushing back large jobs. We
> are using slurm 19.05.8 so not sure if this is patched in newer releases.

This sounds like a problem that we had at NERSC (small jobs pushing back
multi-thousand node jobs), and we carried a local patch for which Doug
managed to get upstreamed in 20.02.x (I think it landed in 20.02.3, but
20.02.6 is the current version).

Hope this helps!
Chris
--
Chris Samuel  :  
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=04%7C01%7Cd.j.baker%40soton.ac.uk%7Ccc84ff45cb604a29dd6208d89c614721%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C63743128890119%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OuSpfkTGBscxqTfJ0CbvX44GanHn4J76p9tV1M1AqSw%3D&reserved=0
  :  Berkeley, CA, USA



Re: [slurm-users] Backfill pushing jobs back

2020-12-21 Thread David Baker
Hello,

Could I please follow up on the Slurm patch that relates to smaller jobs 
pushing large jobs back? My colleague downloaded and installed the most recent 
production version of Slurm today and tells me that it did not appear to 
resolve the issue. Just to note, we are currently running v19.05.8 and finding 
that the backfill mechanism pushes large jobs back. In theory, should the 
latest Slurm help us in sorting that issue out? I understand that we're testing 
v20.11.2, however I should clarify that with my colleague tomorrow.

Does anyone have any comments, please? Is there any parameter that we need to 
set to activate the backfill patch, for example?

Best regards,
David


From: slurm-users  on behalf of Chris 
Samuel 
Sent: 09 December 2020 16:37
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Backfill pushing jobs back

CAUTION: This e-mail originated outside the University of Southampton.

Hi David,

On 9/12/20 3:35 am, David Baker wrote:

> We see the following issue with smaller jobs pushing back large jobs. We
> are using slurm 19.05.8 so not sure if this is patched in newer releases.

This sounds like a problem that we had at NERSC (small jobs pushing back
multi-thousand node jobs), and we carried a local patch for which Doug
managed to get upstreamed in 20.02.x (I think it landed in 20.02.3, but
20.02.6 is the current version).

Hope this helps!
Chris
--
Chris Samuel  :  
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=04%7C01%7Cd.j.baker%40soton.ac.uk%7Ccc84ff45cb604a29dd6208d89c614721%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C63743128890119%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OuSpfkTGBscxqTfJ0CbvX44GanHn4J76p9tV1M1AqSw%3D&reserved=0
  :  Berkeley, CA, USA



[slurm-users] Backfill pushing jobs back

2021-01-04 Thread David Baker
Hello,

Last year I posted on this forum looking for some help on backfill in Slurm. We 
are currently using Slurm 19.05.8 and we find that backfilled (smaller) jobs 
tend to push back large jobs in our cluster. Chris Samuel replied to our post 
with the following response...

This sounds like a problem that we had at NERSC (small jobs pushing back 
multi-thousand node jobs), and we carried a local patch for which Doug
managed to get upstreamed in 20.02.x (I think it landed in 20.02.3, but 20.02.6 
is the current version).

We looked through the release notes and sure enough there is a reference to a 
job starvation patch, however I'm not sure that it is the relevant patch... (in 
20.02.2)
>  -- Fix scheduling issue when there are not enough nodes available to run a 
> job
> resulting in possible job starvation.

We decided to download and install the latest production version, 20.11.2, of 
Slurm. One of my team members managed the installation and ran his backfill 
tests only to find that the above backfill issue was still present. Should we 
wind back to version 20.02.6 and insall/test that instead? Could someone please 
advise use? It would seem odd that a recent version of slurm would still have a 
backfill issue that starves larger job out. We're wondering if you have 
forgotten to configure something very fundamental, for example.

Best regards,
David


[slurm-users] Validating SLURM sreport cluster utilization report

2021-01-22 Thread David Simpson
Hi,

We've been using the sreport cluster utilization report to report on Down time 
and therefore produce an uptime figure for the entire cluster. Which we hope 
will be above 99% or very close to, for every month of the year.

Most of the time the figure that comes back is one that fits the perception of 
the day to day running of the cluster.

We don't log node UP/DOWN in any way (beyond what slurm does) and rely on 
sreport as explained above.

The December figure we have is lower than 99% and there are 438 slurm nodes in 
the cluster. In December we only remember having problems with 3 nodes. So at 
the moment off the top of the head we don't understand this reported Down time.

Is anyone else relying on sreport for this metric? If so have you encountered 
this sort of situation?

regards
David


-
David Simpson - Senior Systems Engineer
ARCCA, Redwood Building,
King Edward VII Avenue,
Cardiff, CF10 3NB

David Simpson - peiriannydd uwch systemau
ARCCA, Adeilad Redwood,
King Edward VII Avenue,
Caerdydd, CF10 3NB

simpso...@cardiff.ac.uk<mailto:simpso...@cardiff.ac.uk>
+44 29208 74657

COVID-19 Cardiff University is currently under remote work restrictions. Our 
staff are continuing normal work schedules, but responses may be slower than 
usual.  We appreciate your patience during this unprecedented time

COVID-19 Ar hyn o bryd mae Prifysgol Caerdydd o dan gyfyngiadau gweithio o 
bell.  Mae ein staff yn parhau ag amserlenni gwaith arferol, ond gall ymatebion 
fod yn arafach na'r arfer. Rydym yn gwerthfawrogi eich amynedd yn ystod yr 
amser digynsail hwn.



Re: [slurm-users] Validating SLURM sreport cluster utilization report

2021-01-29 Thread David Simpson
Out of interest (for those that do record and/or report on uptime) if you 
aren't using the sreport cluster utilization report what alternative method are 
you using instead?

If you are using sreport cluster utilization report have you encountered this?

thanks
David

-
David Simpson - Senior Systems Engineer
ARCCA, Redwood Building,
King Edward VII Avenue,
Cardiff, CF10 3NB

David Simpson - peiriannydd uwch systemau
ARCCA, Adeilad Redwood,
King Edward VII Avenue,
Caerdydd, CF10 3NB

simpso...@cardiff.ac.uk<mailto:simpso...@cardiff.ac.uk>
+44 29208 74657

COVID-19 Cardiff University is currently under remote work restrictions. Our 
staff are continuing normal work schedules, but responses may be slower than 
usual.  We appreciate your patience during this unprecedented time

COVID-19 Ar hyn o bryd mae Prifysgol Caerdydd o dan gyfyngiadau gweithio o 
bell.  Mae ein staff yn parhau ag amserlenni gwaith arferol, ond gall ymatebion 
fod yn arafach na'r arfer. Rydym yn gwerthfawrogi eich amynedd yn ystod yr 
amser digynsail hwn.

From: slurm-users  On Behalf Of David 
Simpson
Sent: 22 January 2021 16:34
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Validating SLURM sreport cluster utilization report

Hi,

We've been using the sreport cluster utilization report to report on Down time 
and therefore produce an uptime figure for the entire cluster. Which we hope 
will be above 99% or very close to, for every month of the year.

Most of the time the figure that comes back is one that fits the perception of 
the day to day running of the cluster.

We don't log node UP/DOWN in any way (beyond what slurm does) and rely on 
sreport as explained above.

The December figure we have is lower than 99% and there are 438 slurm nodes in 
the cluster. In December we only remember having problems with 3 nodes. So at 
the moment off the top of the head we don't understand this reported Down time.

Is anyone else relying on sreport for this metric? If so have you encountered 
this sort of situation?
regards
David

-
David Simpson - Senior Systems Engineer
ARCCA, Redwood Building,
King Edward VII Avenue,
Cardiff, CF10 3NB

David Simpson - peiriannydd uwch systemau
ARCCA, Adeilad Redwood,
King Edward VII Avenue,
Caerdydd, CF10 3NB

simpso...@cardiff.ac.uk<mailto:simpso...@cardiff.ac.uk>
+44 29208 74657

COVID-19 Cardiff University is currently under remote work restrictions. Our 
staff are continuing normal work schedules, but responses may be slower than 
usual.  We appreciate your patience during this unprecedented time

COVID-19 Ar hyn o bryd mae Prifysgol Caerdydd o dan gyfyngiadau gweithio o 
bell.  Mae ein staff yn parhau ag amserlenni gwaith arferol, ond gall ymatebion 
fod yn arafach na'r arfer. Rydym yn gwerthfawrogi eich amynedd yn ystod yr 
amser digynsail hwn.



[slurm-users] sacctmgr archive dump - no dump file produced, and data not purged?

2021-02-05 Thread Chin,David
Hi all:

I have a new cluster, and I am attempting to dump all the accounting data that 
I generated in the test period before our official opening.

Installation info:

  *   Bright Cluster Manager 9.0
  *   Slurm 20.02.6
  *   Red Hat 8.1

In slurmdbd.conf, I have:

ArchiveJobs=yes
ArchiveSteps=yes
ArchiveEvents=yes
ArchiveSuspend=yes

On the commandline, I do:

$ sudo sacctmgr archive dump Directory=/data/Backups/Slurm 
PurgeEventAfter=1hours PurgeJobAfter=1hours PurgeStepAfter=1hours 
PurgeSuspendAfter=1hours
This may result in loss of accounting database records (if Purge* options 
enabled).
Are you sure you want to continue? (You have 30 seconds to decide)
(N/y): y
sacctmgr: slurmdbd: SUCCESS

However, no dump file is produced. And if I run sreport, I still see data from 
last month. (I also tried "1hour", i.e. dropping the "s".)

Is there something I am missing?

Thanks,
    Dave Chin

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


[slurm-users] Unsetting a QOS Flag?

2021-02-08 Thread Chin,David
Hello all:

I have a QOS defined which has the Flaq DenyOnLimit set:

$ sacctmgr show qos foo format=name,flags
  NameFlags
-- 
  foo   DenyOnLimit


How can I "unset" that Flag?

I tried "sacctmgr modify qos foo unset Flags=DenyOnLimit", and "sacctmgr modify 
qos foo set Flags=NoDenyOnLimit", to no avail.

Thanks in advance,
Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


Re: [slurm-users] sacctmgr archive dump - no dump file produced, and data not purged?

2021-02-09 Thread Chin,David
Well, I seem to have figured it out. This worked and did what I wanted to (I 
think):

$ sudo sacctmgr archive dump Directory=/data/Backups/Slurm 
PurgeEventAfter=1hour \
  PurgeJobAfter=1hour PurgeStepAfter=1hour PurgeSuspendAfter=1hour \
  PurgeUsageAfter=1hour Events Jobs Steps Suspend Usage

This generated various usage dump files, and the job_table and step_table dumps.

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


From: slurm-users  on behalf of 
Chin,David 
Sent: Friday, February 5, 2021 15:47
To: Slurm-Users List 
Subject: [slurm-users] sacctmgr archive dump - no dump file produced, and data 
not purged?


External.

Hi all:

I have a new cluster, and I am attempting to dump all the accounting data that 
I generated in the test period before our official opening.

Installation info:

  *   Bright Cluster Manager 9.0
  *   Slurm 20.02.6
  *   Red Hat 8.1

In slurmdbd.conf, I have:

ArchiveJobs=yes
ArchiveSteps=yes
ArchiveEvents=yes
ArchiveSuspend=yes

On the commandline, I do:

$ sudo sacctmgr archive dump Directory=/data/Backups/Slurm 
PurgeEventAfter=1hours PurgeJobAfter=1hours PurgeStepAfter=1hours 
PurgeSuspendAfter=1hours
This may result in loss of accounting database records (if Purge* options 
enabled).
Are you sure you want to continue? (You have 30 seconds to decide)
(N/y): y
sacctmgr: slurmdbd: SUCCESS

However, no dump file is produced. And if I run sreport, I still see data from 
last month. (I also tried "1hour", i.e. dropping the "s".)

Is there something I am missing?

Thanks,
    Dave Chin

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


Drexel Internal Data


Drexel Internal Data


[slurm-users] sreport cluster AccountUtilizationByUser showing utilization of a deleted account

2021-02-09 Thread Chin,David
Hello, all:

Details:

  *   slurm 20.02.6
  *   MariaDB 10.3.17
  *   RHEL 8.1

I have a fairshare setup. I went through a couple of iterations in testing of
manually creating accounts and users that I later deleted before putting in
what is to be the production setup.

One of the deleted accounts is named "urcfadm" - in the slurm_acct_db → 
acct_table,
the row (?) for that account has a value 1 in the "deleted" column:

  creation_time   mod_timedeleted namedescription organization
  1607378518  1611091499  1   urcfadm urcf_sysadmins  research

I also purged all Events, Jobs, Steps, Suspend, Usage that are older than 1 
hour.

  sacctmgr archive dump Directory=/data/Backups/Slurm PurgeEventAfter=1hour \
PurgeJobAfter=1hour PurgeStepAfter=1hour PurgeSuspendAfter=1hour \
PurgeUsageAfter=1hour  Events Jobs Steps Suspend Usage

When I run

   sreport cluster AccountUtilizationByUser Start=2021-02-09 End=2021-02-10 -T 
billing

I get numbers which don't add up as one goes up to the root node of the tree. 
And I
have a line for the account "urcfadm":


  Cluster Account Login Proper Name  TRES Name Used
- --- - --- -- 
  ...
  picotte urcfadm  billing   110708
  ...


In the report period, no jobs ran under the urcfadm account.

Is there a way to fix this without just purging all the data?

If there is no "graceful" fix, is there a way I can "reset" the slurm_acct_db,
i.e. actually purge all data in all tables?


Thanks in advance,
   Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


Re: [slurm-users] prolog not passing env var to job

2021-03-03 Thread Chin,David
ahmet.mer...@uhem.itu.edu.tr wrote:
> Prolog and TaskProlog are different parameters and scripts. You should
> use the TaskProlog script to set env. variables.

Can you tell me how to do this for srun? E.g. users request an interactive 
shell:

srun -n 1 -t 600 --pty /bin/bash

but the shell on the compute node does not have the env variables set.

I use the same prolog script as TaskProlog, which sets it properly for jobs 
submitted
with sbatch.

Thanks in advance,
Dave Chin

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



From: slurm-users  on behalf of mercan 

Sent: Friday, February 12, 2021 16:27
To: Slurm User Community List ; Herc Silverstein 
; slurm-us...@schedmd.com 

Subject: Re: [slurm-users] prolog not passing env var to job

External.

Hi;

Prolog and TaskProlog are different parameters and scripts. You should
use the TaskProlog script to set env. variables.

Regards;

Ahmet M.




Drexel Internal Data


Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
Hi, Brian:

So, this is my SrunProlog script -- I want a job-specific tmp dir, which makes 
for easy cleanup at end of job:

#!/bin/bash
if [[ -z ${SLURM_ARRAY_JOB_ID+x} ]]
then
export TMP="/local/scratch/${SLURM_JOB_ID}"
export TMPDIR="${TMP}"
export LOCAL_TMPDIR="${TMP}"
export BEEGFS_TMPDIR="/beegfs/scratch/${SLURM_JOB_ID}"
else
export TMP="/local/scratch/${SLURM_ARRAY_JOB_ID}.${SLURM_ARRAY_TASK_ID}"
export TMPDIR="${TMP}"
export LOCAL_TMPDIR="${TMP}"
export 
BEEGFS_TMPDIR="/beegfs/scratch/${SLURM_ARRAY_JOB_ID}.${SLURM_ARRAY_TASK_ID}"
fi

echo DEBUG srun_set_tmp.sh
echo I am `whoami`

/usr/bin/mkdir -p ${TMP}
chmod 700 ${TMP}
/usr/bin/mkdir -p ${BEEGFS_TMPDIR}
chmod 700 ${BEEGFS_TMPDIR}

And this is my srun session:

picotte001::~$ whoami
dwc62
picotte001::~$ srun -p def --mem 1000 -n 4 -t 600 --pty /bin/bash
DEBUG srun_set_tmp.sh
I am dwc62
node001::~$ echo $TMP
/local/scratch/80472
node001::~$ ll !$
ll $TMP
/bin/ls: cannot access '/local/scratch/80472': No such file or directory
node001::~$ mkdir $TMP
node001::~$ ll -d !$
ll -d $TMP
drwxrwxr-x 2 dwc62 dwc62 6 Mar  4 11:52 /local/scratch/80472/
node001::~$ exit

So, the "echo" and "whoami" statements are executed by the prolog script, as 
expected, but the mkdir commands are not?

Thanks,
Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


From: slurm-users  on behalf of Brian 
Andrus 
Sent: Thursday, March 4, 2021 10:12
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] prolog not passing env var to job


External.


It seems to me, if you are using srun directly to get an interactive shell, you 
can just run the script once you get your shell.


You can set the variables and then run srun. It automatically exports the 
environment.

If you want to change a particular one (or more), use something like 
--export=ALL,MYVAR=othervalue

do 'man srun' and look at the --export option


Brian Andrus




On 3/3/2021 9:28 PM, Chin,David wrote:
ahmet.mer...@uhem.itu.edu.tr<mailto:ahmet.mer...@uhem.itu.edu.tr> wrote:
> Prolog and TaskProlog are different parameters and scripts. You should
> use the TaskProlog script to set env. variables.

Can you tell me how to do this for srun? E.g. users request an interactive 
shell:

srun -n 1 -t 600 --pty /bin/bash

but the shell on the compute node does not have the env variables set.

I use the same prolog script as TaskProlog, which sets it properly for jobs 
submitted
with sbatch.

Thanks in advance,
Dave Chin

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu<mailto:dw...@drexel.edu> 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu<mailto:urcf-supp...@drexel.edu>
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



From: slurm-users 
<mailto:slurm-users-boun...@lists.schedmd.com>
 on behalf of mercan 
<mailto:ahmet.mer...@uhem.itu.edu.tr>
Sent: Friday, February 12, 2021 16:27
To: Slurm User Community List 
<mailto:slurm-users@lists.schedmd.com>; Herc 
Silverstein 
<mailto:herc.silverst...@schrodinger.com>; 
slurm-us...@schedmd.com<mailto:slurm-us...@schedmd.com> 
<mailto:slurm-us...@schedmd.com>
Subject: Re: [slurm-users] prolog not passing env var to job

External.

Hi;

Prolog and TaskProlog are different parameters and scripts. You should
use the TaskProlog script to set env. variables.

Regards;

Ahmet M.




Drexel Internal Data


Drexel Internal Data


Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
Hi Brian:

This works just as I expect for sbatch.

The example srun execution I showed was a non-array job, so the first half of 
the "if []" statement holds. It is the second half, which deals with job 
arrays, which has the period.

The value of TMP is correct, i.e. "/local/scratch/80472"

And the command, in the prolog script is correct, i.e. "/usr/bin/mkdir -p 
${TMP}"

If I type that command during the interactive job, it does what I expect, i.e. 
creates the directory $TMP = /local/scratch/80472

Regards,
Dave

From: slurm-users  on behalf of Brian 
Andrus 
Sent: Thursday, March 4, 2021 13:48
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] prolog not passing env var to job


External.

I think it isn't running how you think or there is something not provided in 
the description.


You have:

export TMP="/local/scratch/${SLURM_ARRAY_JOB_ID}.${SLURM_ARRAY_TASK_ID}"

Notice that period in there.
Then you have:
node001::~$ echo $TMP
/local/scratch/80472
There is no period.
In fact, SLURM_ARRAY_JOB_ID should be blank too if you are not running as an 
array session.

However, to your desire for a job-specific tmp directory:
Check out the mktemp command. It should do just what you want. I use it for 
interactive desktop sessions for users to create the temp directory that is 
used for X sessions.
You just need to make sure the user has write access to the directory you are 
creating the directory in (chmod 1777 for the parent directory is good)

Brian Andrus

On 3/4/2021 9:03 AM, Chin,David wrote:
Hi, Brian:

So, this is my SrunProlog script -- I want a job-specific tmp dir, which makes 
for easy cleanup at end of job:

#!/bin/bash
if [[ -z ${SLURM_ARRAY_JOB_ID+x} ]]
then
export TMP="/local/scratch/${SLURM_JOB_ID}"
export TMPDIR="${TMP}"
export LOCAL_TMPDIR="${TMP}"
export BEEGFS_TMPDIR="/beegfs/scratch/${SLURM_JOB_ID}"
else
export TMP="/local/scratch/${SLURM_ARRAY_JOB_ID}.${SLURM_ARRAY_TASK_ID}"
export TMPDIR="${TMP}"
export LOCAL_TMPDIR="${TMP}"
export 
BEEGFS_TMPDIR="/beegfs/scratch/${SLURM_ARRAY_JOB_ID}.${SLURM_ARRAY_TASK_ID}"
fi

echo DEBUG srun_set_tmp.sh
echo I am `whoami`

/usr/bin/mkdir -p ${TMP}
chmod 700 ${TMP}
/usr/bin/mkdir -p ${BEEGFS_TMPDIR}
chmod 700 ${BEEGFS_TMPDIR}

And this is my srun session:

picotte001::~$ whoami
dwc62
picotte001::~$ srun -p def --mem 1000 -n 4 -t 600 --pty /bin/bash
DEBUG srun_set_tmp.sh
I am dwc62
node001::~$ echo $TMP
/local/scratch/80472
node001::~$ ll !$
ll $TMP
/bin/ls: cannot access '/local/scratch/80472': No such file or directory
node001::~$ mkdir $TMP
node001::~$ ll -d !$
ll -d $TMP
drwxrwxr-x 2 dwc62 dwc62 6 Mar  4 11:52 /local/scratch/80472/
node001::~$ exit

So, the "echo" and "whoami" statements are executed by the prolog script, as 
expected, but the mkdir commands are not?

Thanks,
Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu<mailto:dw...@drexel.edu> 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu<mailto:urcf-supp...@drexel.edu>
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


From: slurm-users 
<mailto:slurm-users-boun...@lists.schedmd.com>
 on behalf of Brian Andrus <mailto:toomuc...@gmail.com>
Sent: Thursday, March 4, 2021 10:12
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> 
<mailto:slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] prolog not passing env var to job


External.


It seems to me, if you are using srun directly to get an interactive shell, you 
can just run the script once you get your shell.


You can set the variables and then run srun. It automatically exports the 
environment.

If you want to change a particular one (or more), use something like 
--export=ALL,MYVAR=othervalue

do 'man srun' and look at the --export option


Brian Andrus




On 3/3/2021 9:28 PM, Chin,David wrote:
ahmet.mer...@uhem.itu.edu.tr<mailto:ahmet.mer...@uhem.itu.edu.tr> wrote:
> Prolog and TaskProlog are different parameters and scripts. You should
> use the TaskProlog script to set env. variables.

Can you tell me how to do this for srun? E.g. users request an interactive 
shell:

srun -n 1 -t 600 --pty /bin/bash

but the shell on the compute node does not have the env variables set.

I use the same prolog script as TaskProlog, which sets it properly for jobs 
submitted
with sbatch.

Thanks in advance,
Dave Chin

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu<mailto:dw...@drexel.edu> 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu<mailto:urcf-supp...@drexel.edu>
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecod

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
My mistake - from slurm.conf(5):

SrunProlog runs on the node where the "srun" is executing.

i.e. the login node, which explains why the directory is not being created on 
the compute node, while the echos work.

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


Drexel Internal Data


[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Hi, all:

I'm trying to understand why a job exited with an error condition. I think it 
was actually terminated by Slurm: job was a Matlab script, and its output was 
incomplete.

Here's sacct output:

   JobIDJobName  User  PartitionNodeListElapsed 
 State ExitCode ReqMem MaxRSS  MaxVMSize
AllocTRES AllocGRE
 -- - -- --- -- 
--  -- -- -- 
 
   83387 ProdEmisI+  foobdef node001   03:34:26 
OUT_OF_ME+0:125  128Gn   
billing=16,cpu=16,node=1
 83387.batch  batch  node001   03:34:26 
OUT_OF_ME+0:125  128Gn   1617705K   7880672K  
cpu=16,mem=0,node=1
83387.extern extern  node001   03:34:26 
 COMPLETED  0:0  128Gn   460K153196K 
billing=16,cpu=16,node=1

Thanks in advance,
    Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Here's seff output, if it makes any difference. In any case, the exact same job 
was run by the user on their laptop with 16 GB RAM with no problem.

Job ID: 83387
Cluster: picotte
User/Group: foob/foob
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 06:50:30
CPU Efficiency: 11.96% of 2-09:10:56 core-walltime
Job Wall-clock time: 03:34:26
Memory Utilized: 1.54 GB
Memory Efficiency: 1.21% of 128.00 GB


--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


From: slurm-users  on behalf of Paul 
Edmon 
Sent: Monday, March 15, 2021 14:02
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and 
MaxVMSize are under the ReqMem value


External.

One should keep in mind that sacct results for memory usage are not accurate 
for Out Of Memory (OoM) jobs.  This is due to the fact that the job is 
typically terminated prior to next sacct polling period, and also terminated 
prior to it reaching full memory allocation.  Thus I wouldn't trust any of the 
results with regards to memory usage if the job is terminated by OoM.  sacct 
just can't pick up a sudden memory spike like that and even if it did  it would 
not correctly record the peak memory because the job was terminated prior to 
that point.


-Paul Edmon-


On 3/15/2021 1:52 PM, Chin,David wrote:
Hi, all:

I'm trying to understand why a job exited with an error condition. I think it 
was actually terminated by Slurm: job was a Matlab script, and its output was 
incomplete.

Here's sacct output:

   JobIDJobName  User  PartitionNodeListElapsed 
 State ExitCode ReqMem MaxRSS  MaxVMSize
AllocTRES AllocGRE
 -- - -- --- -- 
--  -- -- -- 
 
   83387 ProdEmisI+  foobdef node001   03:34:26 
OUT_OF_ME+0:125  128Gn   
billing=16,cpu=16,node=1
 83387.batch  batch  node001   03:34:26 
OUT_OF_ME+0:125  128Gn   1617705K   7880672K  
cpu=16,mem=0,node=1
83387.extern extern  node001   03:34:26 
 COMPLETED  0:0  128Gn   460K153196K 
billing=16,cpu=16,node=1

Thanks in advance,
Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu<mailto:dw...@drexel.edu> 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu<mailto:urcf-supp...@drexel.edu>
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


Drexel Internal Data


Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Hi Michael:

I looked at the Matlab script: it's loading an xlsx file which is 2.9 kB.

There are some "static" arrays allocated with ones() or zeros(), but those use 
small subsets (< 10 columns) of the loaded data, and outputs are arrays of 
6x10. Certainly there are not 16e9 rows in the original file.

Saved output .mat file is only 1.8kB.

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



From: slurm-users  on behalf of Renfro, 
Michael 
Sent: Monday, March 15, 2021 14:04
To: Slurm User Community List 
Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and 
MaxVMSize are under the ReqMem value


External.

Just a starting guess, but are you certain the MATLAB script didn’t try to 
allocate enormous amounts of memory for variables? That’d be about 16e9 
floating point values, if I did the units correctly.




Drexel Internal Data


Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
One possible datapoint: on the node where the job ran, there were two 
slurmstepd processes running, both at 100%CPU even after the job had ended.


--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


From: slurm-users  on behalf of 
Chin,David 
Sent: Monday, March 15, 2021 13:52
To: Slurm-Users List 
Subject: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and 
MaxVMSize are under the ReqMem value


External.

Hi, all:

I'm trying to understand why a job exited with an error condition. I think it 
was actually terminated by Slurm: job was a Matlab script, and its output was 
incomplete.

Here's sacct output:

   JobIDJobName  User  PartitionNodeListElapsed 
 State ExitCode ReqMem MaxRSS  MaxVMSize
AllocTRES AllocGRE
 -- - -- --- -- 
--  -- -- -- 
 
   83387 ProdEmisI+  foobdef node001   03:34:26 
OUT_OF_ME+0:125  128Gn   
billing=16,cpu=16,node=1
 83387.batch  batch  node001   03:34:26 
OUT_OF_ME+0:125  128Gn   1617705K   7880672K  
cpu=16,mem=0,node=1
83387.extern extern  node001   03:34:26 
 COMPLETED  0:0  128Gn   460K153196K 
billing=16,cpu=16,node=1

Thanks in advance,
    Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode



Drexel Internal Data


Drexel Internal Data


Drexel Internal Data


  1   2   >