Re: [slurm-users] Weirdness with partitions
I would think that slurm would only filter it out, potentially, if the partition in question (b4) was marked as "hidden" and only accessible by the correct account. On Thu, Sep 21, 2023 at 3:11 AM Diego Zuccato wrote: > Hello all. > > We have one partition (b4) that's reserved for an account while the > others are "free for all". > The problem is that > sbatch --partition=b1,b2,b3,b4,b5 test.sh > fails with > sbatch: error: Batch job submission failed: Invalid account or > account/partition combination specified > while > sbatch --partition=b1,b2,b3,b5 test.sh > succeeds. > > Shouldn't Slurm (22.05.6) just "filter out" the inaccessible partition, > considering only the others? > Just like what it does if I'm requesting more cores than available on a > node. > > I'd really like to avoid having to replicate scheduler logic in > job_submit.lua... :) > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > > -- David Rhey --- Advanced Research Computing University of Michigan
Re: [slurm-users] Weirdness with partitions
Slurm is working as it should. From your own examples you proved that; by not submitting to b4 the job works. However, looking at man sbatch: -p, --partition= Request a specific partition for the resource allocation. If not specified, the default behavior is to allow the slurm controller to select the default partition as designated by the system administrator. If the job can use more than one partition, specify their names in a comma separate list and the one offering earliest initiation will be used with no regard given to the partition name ordering (although higher pri‐ ority partitions will be considered first). When the job is initiated, the name of the partition used will be placed first in the job record partition string. In your example, the job can NOT use more than one partition (given the restrictions defined on the partition itself precluding certain accounts from using it). This, to me, seems either like a user education issue (i.e. don't have them submit to every partition), or you can try the job submit lua route - or perhaps the hidden partition route (which I've not tested). On Thu, Sep 21, 2023 at 9:18 AM Diego Zuccato wrote: > Uh? It's not a problem if other users see there are jobs in the > partition (IIUC it's what 'hidden' is for), even if they can't use it. > > The problem is that if it's included in --partition it prevents jobs > from being queued! > Nothing in the documentation about --partition made me think that > forbidding access to one partition would make a job unqueueable... > > Diego > > Il 21/09/2023 14:41, David ha scritto: > > I would think that slurm would only filter it out, potentially, if the > > partition in question (b4) was marked as "hidden" and only accessible by > > the correct account. > > > > On Thu, Sep 21, 2023 at 3:11 AM Diego Zuccato > <mailto:diego.zucc...@unibo.it>> wrote: > > > > Hello all. > > > > We have one partition (b4) that's reserved for an account while the > > others are "free for all". > > The problem is that > > sbatch --partition=b1,b2,b3,b4,b5 test.sh > > fails with > > sbatch: error: Batch job submission failed: Invalid account or > > account/partition combination specified > > while > > sbatch --partition=b1,b2,b3,b5 test.sh > > succeeds. > > > > Shouldn't Slurm (22.05.6) just "filter out" the inaccessible > partition, > > considering only the others? > > Just like what it does if I'm requesting more cores than available > on a > > node. > > > > I'd really like to avoid having to replicate scheduler logic in > > job_submit.lua... :) > > > > -- > > Diego Zuccato > > DIFA - Dip. di Fisica e Astronomia > > Servizi Informatici > > Alma Mater Studiorum - Università di Bologna > > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > > tel.: +39 051 20 95786 > > > > > > > > -- > > David Rhey > > --- > > Advanced Research Computing > > University of Michigan > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > > -- David Rhey --- Advanced Research Computing University of Michigan
Re: [slurm-users] Weirdness with partitions
That's not at all how I interpreted this man page description. By "If the job can use more than..." I thought it was completely obvious (although perhaps wrong, if your interpretation is correct, but it never crossed my mind) that it referred to whether the _submitting user_ is OK with it using more than one partition. The partition where the user is forbidden (because of the partition's allowed account) should just be _not_ the earliest initiation (because it'll never initiate there), and therefore not run there, but still be able to run on the other partitions listed in the batch script. > that's fair. I was considering this only given the fact that we know the user doesn't have access to a partition (this isn't the surprise here) and that slurm communicates that as the reason pretty clearly. I can see how if a user is submitting against multiple partitions they might hope that if a job couldn't run in a given partition, given the number of others provided, the scheduler might consider all of those *before* dying outright at the first rejection. On Thu, Sep 21, 2023 at 10:28 AM Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) wrote: > On Sep 21, 2023, at 9:46 AM, David wrote: > > Slurm is working as it should. From your own examples you proved that; by > not submitting to b4 the job works. However, looking at man sbatch: > >-p, --partition= > Request a specific partition for the resource allocation. > If not specified, the default behavior is to allow the slurm controller to > select > the default partition as designated by the system > administrator. If the job can use more than one partition, specify their > names in a comma > separate list and the one offering earliest initiation will > be used with no regard given to the partition name ordering (although > higher pri‐ > ority partitions will be considered first). When the job is > initiated, the name of the partition used will be placed first in the job > record > partition string. > > In your example, the job can NOT use more than one partition (given the > restrictions defined on the partition itself precluding certain accounts > from using it). This, to me, seems either like a user education issue (i.e. > don't have them submit to every partition), or you can try the job submit > lua route - or perhaps the hidden partition route (which I've not tested). > > > That's not at all how I interpreted this man page description. By "If the > job can use more than..." I thought it was completely obvious (although > perhaps wrong, if your interpretation is correct, but it never crossed my > mind) that it referred to whether the _submitting user_ is OK with it using > more than one partition. The partition where the user is forbidden (because > of the partition's allowed account) should just be _not_ the earliest > initiation (because it'll never initiate there), and therefore not run > there, but still be able to run on the other partitions listed in the batch > script. > > I think it's completely counter-intuitive that submitting saying it's OK > to run on one of a few partitions, and one partition happening to be > forbidden to the submitting user, means that it won't run at all. What if > you list multiple partitions, and increase the number of nodes so that > there aren't enough in one of the partitions, but not realize this > problem? Would you expect that to prevent the job from ever running on any > partition? > > Noam > -- David Rhey --- Advanced Research Computing University of Michigan
Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?
A colleague of mine has it scripted out quite well, so I can't speak to *all* of the details. However, we have a user that we submit our jobs as and it does the steps for upgrading (yum, dnf, etc). The jobs are wholenode/exclusive so nothing else can run there, and then a few other steps might be taken (node reboots etc). I think we might have some level of reservation in there so nodes can drain (which would help expedite the situation a bit but it still would depend on your longest running job). This has worked well for . releases/patches and effectively behaves like a rolling upgrade. Yours might even be easier/quicker since it's symlinks (which is SchedMD's preferred method, iirc). Speaking of which, I believe one of the SchedMD folks gave some pointers on that in the past, perhaps in a presentation at SLUG. So you could peruse there, as well. On Thu, Sep 28, 2023 at 12:04 PM Groner, Rob wrote: > > There's 14 steps to upgrading slurm listed on their website, including > shutting down and backing up the database. So far we've only updated slurm > during a downtime, and it's been a major version change, so we've taken all > the steps indicated. > > We now want to upgrade from 23.02.4 to 23.02.5. > > Our slurm builds end up in version named directories, and we tell > production which one to use via symlink. Changing the symlink will > automatically change it on our slurm controller node and all slurmd nodes. > > Is there an expedited, simple, slimmed down upgrade path to follow if > we're looking at just a . level upgrade? > > Rob > > -- David Rhey --- Advanced Research Computing University of Michigan
Re: [slurm-users] TRES sreport per association
Hello, Perhaps`scontrol show assoc` might be able to help you here, in part? Or even sshare. Those would be the raw usage numbers, if I remember correctly. But it might help get you some insight as to usages (though not analogous to what sreport would show). As a note: `scontrol show assoc` will be very lengthy output. HTH, David On Sun, Nov 12, 2023 at 6:03 PM Kamil Wilczek wrote: > Dear All, > > is is possible to report GPU Minutes per association? Suppose > I have two associations like this: > >sacctmgr show assoc where user=$(whoami) > format=account%10,user%16,partition%12,qos%12,grptresmins%20 > > Account UserPartition QOS GrpTRESMins > -- > staffkmwil gpu_adv 1gpu1d gres/gpu=1 > staffkmwil common 4gpu4d gres/gpu=100 > > When I run "sreport" I get (I think) the cumulative report. There > is no "association" option for the "--format" flag for "sreport". > > In my setup I divide the cluster using GPU generations. Older > cards, like TITAN V are accessible for all users (a common > partition), but, for example, a partition with nodes with A100 > is accessible only for selected users. > > Each user gets a QoS ("4gpu4d" means that a user can allocate > 4 GPUs at most and a single job time limit is 4 days). Each > user is also limited to a number of GPUMinutes for each > association and it would be nice to know how many minutes > are left per assoc. > > Kind regards > -- > Kamil Wilczek [https://keys.openpgp.org/] > [6C4BE20A90A1DBFB3CBE2947A832BF5A491F9F2A] > -- David Rhey --- Advanced Research Computing University of Michigan
[slurm-users] "command not found"
Hi, when running a sbatch script i get "command not found". The command is blast (quite used bioinformatics tool). The problem comes from the fact that the blast binary is installed in the master node but not on the other nodes. When the job runs on another node the binary is not found. What would be way to deal with this situation ? what is common practice ? thanks, david
[slurm-users] External provisioning for accounts and other things (?)
Hello, All, First time caller, long-time listener. Does anyone use any sort of external tool (e.g. a form submission) that generates accounts for their Slurm environment (notably for new accounts/allocations)? An example of this would be: a group or user needs us to provision resources for them to run on and so they submit a form to us with information on their needs and we provision for them. If anyone is using external utilities, are they manually putting that in or are they leveraging Slurm APIs to do this? It's a long shot, but if anyone is doing this with ServiceNow, I'd be extra interested in how you achieved that. Thanks! -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
Re: [slurm-users] External provisioning for accounts and other things (?)
Thanks!! We're currently in a similar boat where things are provisioned in a form, and we receive an email and act on that information with scripts and some text expansion. We were wondering whether or not some tighter integration was possible - but that'd be a feature down the road as we'd want to be sure the process was predictable. On Tue, Sep 18, 2018 at 4:04 PM Thomas M. Payerle wrote: > We make use of an large home grown library of Perl scripts this for > creating allocations, creating users, adding users to allocations, etc. > > We have a number of "flavors" of allocations, but most allocation > creation/disabling activity occurs with respect to applications for > allocations which are reviewed by a faculty committee, and although > percentage of applications approved is rather high, it is not automatic > (and many involve requesting the applicant to elaborate or provide > additional information). While we are in the process of migrating the > "application process" to ServiceNow, it will only be as the web form > backend and to track the applications, votes/comments of the faculty > committee, etc. The actual creation of allocations, etc. is all done > manually, albeit by simply invoking a single script or two with a handful > of parameters. The scripts take care of all the Unixy and Slurm tasks > required to create the allocation, etc., as well as sending the standard > welcome email to the allocation > "owner", updating local DBs about the new allocation, etc., and keeping a > log a what was done and why (i.e. linking the action to the specific > application). Scripts exist for > a variety of standard tasks, both high and low level. > > A couple of the underlying libraries (Perl wrappers around sacctmgr and > sshare commands) are available on CPAN (Slurm::Sacctmgr, Slurm::Sshare); > the rest lack the polish and finish required for publishing on CPAN. > > On Tue, Sep 18, 2018 at 3:02 PM David Rhey wrote: > >> Hello, All, >> >> First time caller, long-time listener. Does anyone use any sort of >> external tool (e.g. a form submission) that generates accounts for their >> Slurm environment (notably for new accounts/allocations)? An example of >> this would be: a group or user needs us to provision resources for them to >> run on and so they submit a form to us with information on their needs and >> we provision for them. >> >> If anyone is using external utilities, are they manually putting that in >> or are they leveraging Slurm APIs to do this? It's a long shot, but if >> anyone is doing this with ServiceNow, I'd be extra interested in how you >> achieved that. >> >> Thanks! >> >> -- >> David Rhey >> --- >> Advanced Research Computing - Technology Services >> University of Michigan >> > > > -- > Tom Payerle > DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu > 5825 University Research Park (301) 405-6135 > University of Maryland > College Park, MD 20740-3831 > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
Re: [slurm-users] External provisioning for accounts and other things (?)
Thanks! I'll check this out. Ya'll are awesome for the responses. On Wed, Sep 19, 2018 at 7:57 AM Chris Samuel wrote: > On Wednesday, 19 September 2018 5:00:58 AM AEST David Rhey wrote: > > > First time caller, long-time listener. Does anyone use any sort of > external > > tool (e.g. a form submission) that generates accounts for their Slurm > > environment (notably for new accounts/allocations)? An example of this > > would be: a group or user needs us to provision resources for them to run > > on and so they submit a form to us with information on their needs and we > > provision for them. > > The Karaage cluster management software that was originally written by > folks > at ${JOB-2} and which we used with Slurm at ${JOB-1} does all this. I'm > not > sure how actively maintained it is (as we have our own system at ${JOB}), > but > it's on Github here: > > https://github.com/Karaage-Cluster/karaage/ > > The Python code that handles the Slurm side of things is here: > > > https://github.com/Karaage-Cluster/karaage/blob/master/karaage/datastores/slurm.py > > Hope that helps! > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC > > > > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
[slurm-users] Priority access for a group of users
Hello. We have a small set of compute nodes owned by a group. The group has agreed that the rest of the HPC community can use these nodes providing that they (the owners) can always have priority access to the nodes. The four nodes are well provisioned (1 TByte memory each plus 2 GRID K2 graphics cards) and so there is no need to worry about preemption. In fact I'm happy for the nodes to be used as well as possible by all users. It's just that jobs from the owners must take priority if resources are scarce. What is the best way to achieve the above in slurm? I'm planning to place the nodes in their own partition. The node owners will have priority access to the nodes in that partition, but will have no advantage when submitting jobs to the public resources. Does anyone please have any ideas how to deal with this? Best regards, David
Re: [slurm-users] How to request ONLY one CPU instead of one socket or one node?
Hello, Are you sure you're NOT getting 1 CPU when you run your job? You might want to put some echo logic into your job to look at Slurm env variables of the node your job lands on as a way of checking. E.g.: echo $SLURM_CPUS_ON_NODE echo $SLURM_JOB_CPUS_PER_NODE I don't see anything wrong with your script. As a test I took the basic parameters you've outlined and ran an interactive `srun` session, requesting 1 CPU per task and 4 CPUs per task, and then looked at the aforementioned variable output within each session. For example, requesting 1 CPU per task: [drhey@beta-login ~]$ srun --cpus-per-task=1 --ntasks-per-node=1 --partition=standard --mem=1G --pty bash [drhey@bn19 ~]$ echo $SLURM_CPUS_ON_NODE 1 And again, running this command now asking for 4 CPUs per task and then echoing the env var: [drhey@beta-login ~]$ srun --cpus-per-task=4 --ntasks-per-node=1 --partition=standard --mem=1G --pty bash [drhey@bn19 ~]$ echo $SLURM_CPUS_ON_NODE 4 HTH! David On Wed, Feb 13, 2019 at 9:24 PM Wang, Liaoyuan wrote: > Dear there, > > > > I wrote an analytic program to analyze my data. The analysis costs around > twenty days to analyze all data for one species. When I submit my job to > the cluster, it always request one node instead of one CPU. I am wondering > how I can ONLY request one CPU using “sbatch” command? Below is my batch > file. Any comments and help would be highly appreciated. > > > > Appreciatively, > > Leon > > > > #!/bin/sh > > #SBATCH --ntasks=1 > #SBATCH --cpus-per-task=1 > #SBATCH -t 45-00:00:00 > #SBATCH -J 9625%j > #SBATCH -o 9625.out > #SBATCH -e 9625.err > > /home/scripts/wcnqn.auto.pl > > ======= > > Where wcnqn.auto.pl is my program. 9625 denotes the species number. > > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
Re: [slurm-users] Priority access for a group of users
Hi Paul, Marcus, Thank you for your replies. Using partition priority all makes sense. I was thinking of doing something similar with a set of nodes purchased by another group. That is, having a private high priority partition and a lower priority "scavenger" partition for the public. In this case scavenger jobs will get killed when preempted. In the present case , I did wonder if it would be possible to do something with just a single partition -- hence my question.Your replies have convinced me that two partitions will work -- with preemption leading to re-queued jobs. Best regards, David On Fri, Feb 15, 2019 at 3:09 PM Paul Edmon wrote: > Yup, PriorityTier is what we use to do exactly that here. That said > unless you turn on preemption jobs may still pend if there is no space. We > run with REQUEUE on which has worked well. > > > -Paul Edmon- > > > On 2/15/19 7:19 AM, Marcus Wagner wrote: > > Hi David, > > as far as I know, you can use the PriorityTier (partition parameter) to > achieve this. According to the manpages (if I remember right) jobs from > higher priority tier partitions have precedence over jobs from lower > priority tier partitions, without taking the normal fairshare priority into > consideration. > > Best > Marcus > > On 2/15/19 10:07 AM, David Baker wrote: > > Hello. > > > We have a small set of compute nodes owned by a group. The group has > agreed that the rest of the HPC community can use these nodes providing > that they (the owners) can always have priority access to the nodes. The > four nodes are well provisioned (1 TByte memory each plus 2 GRID K2 > graphics cards) and so there is no need to worry about preemption. In fact > I'm happy for the nodes to be used as well as possible by all users. It's > just that jobs from the owners must take priority if resources are scarce. > > > What is the best way to achieve the above in slurm? I'm planning to place > the nodes in their own partition. The node owners will have priority access > to the nodes in that partition, but will have no advantage when submitting > jobs to the public resources. Does anyone please have any ideas how to deal > with this? > > > Best regards, > > David > > > > -- > Marcus Wagner, Dipl.-Inf. > > IT Center > Abteilung: Systeme und Betrieb > RWTH Aachen University > Seffenter Weg 23 > 52074 Aachen > Tel: +49 241 80-24383 > Fax: +49 241 80-624383wag...@itc.rwth-aachen.dewww.itc.rwth-aachen.de > >
[slurm-users] Question on billing tres information from sacct, sshare, and scontrol
Hello, I have a small vagrant setup I use for prototyping/testing various things. Right now, it's running Slurm 18.08.4. I am noticing some differences for the billing TRES in the output of various commands (notably that of sacct, sshare, and scontrol show assoc). On a freshly built cluster, therefore with no prior usage data, I run a basic job to generate some usage data: [vagrant@head vagrant]$ sshare -n -P -A drhey1 -o GrpTRESRaw cpu=3,mem=1199,energy=0,node=3,billing=59,fs/disk=0,vmem=0,pages=0 cpu=3,mem=1199,energy=0,node=3,billing=59,fs/disk=0,vmem=0,pages=0 [vagrant@head vagrant]$ sshare -n -P -A drhey1 -o RawUsage 3611 3611 When I look at the same info within sacct I see: [vagrant@head vagrant]$ sacct -X --format=User,JobID,Account,AllocTRES%50,AllocGRES,ReqGRES,Elapsed,ExitCode UserJobIDAccount AllocTRESAllocGRES ReqGRESElapsed ExitCode - -- -- -- vagrant 2drhey1 billing=30,cpu=2,mem=600M,node=2 00:02:00 0:0 Of note is that the billing TRES shows as being equal to 30 in sacct, but 50 in sshare. Something similar happens in scontrol show assoc: ... GrpTRESMins=cpu=N(3),mem=N(1199),energy=N(0),node=N(3),billing=N(59),fs/disk=N(0),vmem=N(0),pages=N(0) ... Can anyone explain the difference in billing TRES value output between the various commands? I have a couple of theories, and have been looking through source code to try and understand a bit better. For context, I am trying to understand what a job costs, and what usage for an account over a span of say a month costs. Any insight is most appreciated! -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
Re: [slurm-users] Priority access for a group of users
Hello, Following up on implementing preemption in Slurm. Thank you again for all the advice. After a short break I've been able to run some basic experiments. Initially, I have kept things very simple and made the following changes in my slurm.conf... # Premption settings PreemptType=preempt/partition_prio PreemptMode=requeue PartitionName=relgroup nodes=red[465-470] ExclusiveUser=YES MaxCPUsPerNode=40 DefaultTime=02:00:00 MaxTime=60:00:00 QOS=relgroup State=UP AllowAccounts=relgroup Priority=10 PreemptMode=off # Scavenger partition PartitionName=scavenger nodes=red[465-470] ExclusiveUser=YES MaxCPUsPerNode=40 DefaultTime=00:15:00 MaxTime=02:00:00 QOS=scavenger State=UP AllowGroups=jfAccessToIridis5 PreemptMode=requeue The nodes in the relgroup queue are owned by the General Relativity group and, of course, they have priority to these nodes. The general population can scavenge these nodes via the scavenger queue. When I use "preemptmode=cancel" I'm happy that the relgroup jobs can preempt the scavenger jobs (and the scavenger jobs are cancelled). When I set the preempt mode to "requeue" I see that the scavenger jobs are still cancelled/killed. Have I missed an important configuration change or is it that lower priority jobs will always be killed and not re-queued? Could someone please advise me on this issue? Also I'm wondering if I really understand the "requeue" option. Does that mean re-queued and run from the beginning or run from the current state (needing check pointing)? Best regards, David On Tue, Feb 19, 2019 at 2:15 PM Prentice Bisbal wrote: > I just set this up a couple of weeks ago myself. Creating two partitions > is definitely the way to go. I created one partition, "general" for normal, > general-access jobs, and another, "interruptible" for general-access jobs > that can be interrupted, and then set PriorityTier accordingly in my > slurm.conf file (Node names omitted for clarity/brevity). > > PartitionName=general Nodes=... MaxTime=48:00:00 State=Up PriorityTier=10 > QOS=general > PartitionName=interruptible Nodes=... MaxTime=48:00:00 State=Up > PriorityTier=1 QOS=interruptible > > I then set PreemptMode=Requeue, because I'd rather have jobs requeued than > suspended. And it's been working great. There are few other settings I had > to change. The best documentation for all the settings you need to change > is https://slurm.schedmd.com/preempt.html > > Everything has been working exactly as desired and advertised. My users > who needed the ability to run low-priority, long-running jobs are very > happy. > > The one caveat is that jobs that will be killed and requeued need to > support checkpoint/restart. So when this becomes a production thing, users > are going to have to acknowledge that they will only use this partition for > jobs that have some sort of checkpoint/restart capability. > > Prentice > > On 2/15/19 11:56 AM, david baker wrote: > > Hi Paul, Marcus, > > Thank you for your replies. Using partition priority all makes sense. I > was thinking of doing something similar with a set of nodes purchased by > another group. That is, having a private high priority partition and a > lower priority "scavenger" partition for the public. In this case scavenger > jobs will get killed when preempted. > > In the present case , I did wonder if it would be possible to do something > with just a single partition -- hence my question.Your replies have > convinced me that two partitions will work -- with preemption leading to > re-queued jobs. > > Best regards, > David > > On Fri, Feb 15, 2019 at 3:09 PM Paul Edmon wrote: > >> Yup, PriorityTier is what we use to do exactly that here. That said >> unless you turn on preemption jobs may still pend if there is no space. We >> run with REQUEUE on which has worked well. >> >> >> -Paul Edmon- >> >> >> On 2/15/19 7:19 AM, Marcus Wagner wrote: >> >> Hi David, >> >> as far as I know, you can use the PriorityTier (partition parameter) to >> achieve this. According to the manpages (if I remember right) jobs from >> higher priority tier partitions have precedence over jobs from lower >> priority tier partitions, without taking the normal fairshare priority into >> consideration. >> >> Best >> Marcus >> >> On 2/15/19 10:07 AM, David Baker wrote: >> >> Hello. >> >> >> We have a small set of compute nodes owned by a group. The group has >> agreed that the rest of the HPC community can use these nodes providing >> that they (the owners) can always have priority access to the nodes. The >> four nodes are well provisioned (1 TByte memory each plus 2 GR
Re: [slurm-users] Priority access for a group of users
Hello, Thank you for reminding me about the sbatch "--requeue" option. When I submit test jobs using this option the preemption and subsequent restart of a job works as expected. I've also played around with "preemptmode=suspend" and that also works, however I suspect we won't use that on these "diskless" nodes. As I note I can scavenge resources and preempt jobs myself (I am a member of the "relgroup" and the general public). That is.. 347104 scavengermyjob djb1 PD 0:00 1 (Resources) 347105 relgroupmyjob djb1 R 17:00 1 red465 On the other hand I do not seem to be able to preempt a job submitted by a colleague. That is, my colleague submits a job to the scavenger queue, it starts to run. I then submit a job to the relgroup queue, however that job fails to preempt my colleague's job and stays in pending status. Does anyone understand what might be wrong, please? Best regards, David On Fri, Mar 1, 2019 at 2:47 PM Antony Cleave wrote: > I have always assumed that cancel just kills the job whereas requeue will > cancel and then start from the beginning. I know that requeue does this. I > never tried cancel. > > I'm a fan of the suspend mode myself but that is dependent on users not > asking for all the ram by default. If you can educate the users then this > works really well as the low priority job stays in ram in suspended mode > while the high priority job completes and then the low priority job > continues from where it stopped. No checkpoints and no killing. > > Antony > > > > On Fri, 1 Mar 2019, 12:23 david baker, wrote: > >> Hello, >> >> Following up on implementing preemption in Slurm. Thank you again for all >> the advice. After a short break I've been able to run some basic >> experiments. Initially, I have kept things very simple and made the >> following changes in my slurm.conf... >> >> # Premption settings >> PreemptType=preempt/partition_prio >> PreemptMode=requeue >> >> PartitionName=relgroup nodes=red[465-470] ExclusiveUser=YES >> MaxCPUsPerNode=40 DefaultTime=02:00:00 MaxTime=60:00:00 QOS=relgroup >> State=UP AllowAccounts=relgroup Priority=10 PreemptMode=off >> >> # Scavenger partition >> PartitionName=scavenger nodes=red[465-470] ExclusiveUser=YES >> MaxCPUsPerNode=40 DefaultTime=00:15:00 MaxTime=02:00:00 QOS=scavenger >> State=UP AllowGroups=jfAccessToIridis5 PreemptMode=requeue >> >> The nodes in the relgroup queue are owned by the General Relativity group >> and, of course, they have priority to these nodes. The general population >> can scavenge these nodes via the scavenger queue. When I use >> "preemptmode=cancel" I'm happy that the relgroup jobs can preempt the >> scavenger jobs (and the scavenger jobs are cancelled). When I set the >> preempt mode to "requeue" I see that the scavenger jobs are still >> cancelled/killed. Have I missed an important configuration change or is it >> that lower priority jobs will always be killed and not re-queued? >> >> Could someone please advise me on this issue? Also I'm wondering if I >> really understand the "requeue" option. Does that mean re-queued and run >> from the beginning or run from the current state (needing check pointing)? >> >> Best regards, >> David >> >> On Tue, Feb 19, 2019 at 2:15 PM Prentice Bisbal wrote: >> >>> I just set this up a couple of weeks ago myself. Creating two partitions >>> is definitely the way to go. I created one partition, "general" for normal, >>> general-access jobs, and another, "interruptible" for general-access jobs >>> that can be interrupted, and then set PriorityTier accordingly in my >>> slurm.conf file (Node names omitted for clarity/brevity). >>> >>> PartitionName=general Nodes=... MaxTime=48:00:00 State=Up >>> PriorityTier=10 QOS=general >>> PartitionName=interruptible Nodes=... MaxTime=48:00:00 State=Up >>> PriorityTier=1 QOS=interruptible >>> >>> I then set PreemptMode=Requeue, because I'd rather have jobs requeued >>> than suspended. And it's been working great. There are few other settings I >>> had to change. The best documentation for all the settings you need to >>> change is https://slurm.schedmd.com/preempt.html >>> >>> Everything has been working exactly as desired and advertised. My users >>> who needed the ability to run low-priority, long-running jobs are very >>> happy. >>> >>> The one caveat is that jobs that will be k
[slurm-users] How do I impose a limit the memory requested by a job?
Hello, I have set up a serial queue to run small jobs in the cluster. Actually, I route jobs to this queue using the job_submit.lua script. Any 1 node job using up to 20 cpus is routed to this queue, unless a user submits their job with an exclusive flag. The partition is shared and so I defined memory to be a resource. I've set default memory/cpu to be 4300 Mbytes. There are 40 cpus installed in the nodes and the usable memory is circa 17200 Mbytes -- hence my default mem/cpu. The compute nodes are defined with RealMemory=19, by the way. I am curious to understand how I can impose a memory limit on the jobs that are submitted to this partition. It doesn't make any sense to request more than the total usable memory on the nodes. So could anyone please advise me how to ensure that users cannot request more than the usable memory on the nodes. Best regards, David PartitionName=serial nodes=red[460-464] Shared=Yes MaxCPUsPerNode=40 DefaultTime=02:00:00 MaxTime=60:00:00 QOS=serial SelectTypeParameters=CR_Core_Memory DefMemPerCPU=4300 State=UP AllowGroups=jfAccessToIridis5 PriorityJobFactor=10 PreemptMode=off
Re: [slurm-users] How do I impose a limit the memory requested by a job?
Hello Paul, Thank you for your advice. That all makes sense. We're running diskless compute nodes and so the usable memory is less than the total memory. So I have added a memory check to my job_submit.lua -- see below. I think that all makes sense. Best regards, David -- Check memory/node is valid if job_desc.min_mem_per_cpu == 9223372036854775808 then job_desc.min_mem_per_cpu = 4300 end memory = job_desc.min_mem_per_cpu * job_desc.min_cpus if memory > 172000 then slurm.log_user("You cannot request more than 172000 Mbytes per node") slurm.log_user("memory is: %u",memory) return slurm.ERROR end On Tue, Mar 12, 2019 at 4:48 PM Paul Edmon wrote: > Slurm should automatically block or reject jobs that can't run on that > partition in terms of memory usage for a single node. So you shouldn't > need to do anything. If you need something less than the max memory per > node then you will need to enforce some limits. We do this via a jobsubmit > lua script. That would be my recommended method. > > > -Paul Edmon- > > > On 3/12/19 12:31 PM, David Baker wrote: > > Hello, > > > I have set up a serial queue to run small jobs in the cluster. Actually, I > route jobs to this queue using the job_submit.lua script. Any 1 node job > using up to 20 cpus is routed to this queue, unless a user submits > their job with an exclusive flag. > > > The partition is shared and so I defined memory to be a resource. I've set > default memory/cpu to be 4300 Mbytes. There are 40 cpus installed in the > nodes and the usable memory is circa 17200 Mbytes -- hence my default > mem/cpu. > > > The compute nodes are defined with RealMemory=19, by the way. > > > I am curious to understand how I can impose a memory limit on the jobs > that are submitted to this partition. It doesn't make any sense to request > more than the total usable memory on the nodes. So could anyone please > advise me how to ensure that users cannot request more than the usable > memory on the nodes. > > > Best regards, > > David > > > PartitionName=serial nodes=red[460-464] Shared=Yes MaxCPUsPerNode=40 > DefaultTime=02:00:00 MaxTime=60:00:00 QOS=serial > SelectTypeParameters=CR_Core_Memory *DefMemPerCPU=4300* State=UP > AllowGroups=jfAccessToIridis5 PriorityJobFactor=10 PreemptMode=off > > > >
[slurm-users] Very large job getting starved out
Hello, I understand that this is not a straight forward question, however I'm wondering if anyone has any useful ideas, please. Our cluster is busy and the QOS has limited users to a maximum of 32 compute nodes on the "batch" queue. Users are making good of the cluster -- for example one user is running five 6 node jobs at the moment. On the other hand, a job belonging to another user has been stalled in the queue for around 7 days. He has made reasonable use of the cluster and as a result his fairshare component is relatively low. Having said that, the priority of his job is high -- it currently one of the highest priority jobs in the batch partition queue. From sprio... JOBID PARTITION PRIORITYAGE FAIRSHAREJOBSIZE PARTITION QOS 359323 batch 180292 10 79646547100 0 I did think that the PriorityDecayHalfLife was quite high at 14 days and so I reduced that to 7 days. For reference I've included the key scheduling settings from the cluster below. Does anyone have any thoughts, please? Best regards, David PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME,FAIR_TREE PriorityMaxAge = 7-00:00:00 PriorityUsageResetPeriod = NONE PriorityType= priority/multifactor PriorityWeightAge = 10 PriorityWeightFairShare = 100 PriorityWeightJobSize = 1000 PriorityWeightPartition = 1000 PriorityWeightQOS = 1
Re: [slurm-users] Very large job getting starved out
Hi Cyrus, Thank you for the links. I've taken a good look through the first link (re the cloud cluster) and the only parameter that might be relevant is "assoc_limit_stop", but I'm not sure if that is relevant in this instance. The reason for the delay of the job in question is "priority", however there are quite a lot of jobs from users in the same accounting group with jobs delayed due to "QOSMaxCpuPerUserLimit". They also talk about using the "builtin" scheduler which I guess would turn off backfill. I have attached a copy of the current slurm.conf so that you and other members can get a better feel for the whole picture. Certainly we see a large number of serial/small (1 node) jobs running through the system and I'm concerned that my setup encourages this behaviour, however how to stem this issue is a mystery to me. If you or anyone else has any relevant thoughts then please let me know. I particular I am keen to understand "assoc_limit_stop" and whether it is a relevant option in this situation. Best regards, David From: slurm-users on behalf of Cyrus Proctor Sent: 21 March 2019 14:19 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Very large job getting starved out Hi David, You might have a look at the thread "Large job starvation on cloud cluster" that started on Feb 27; there's some good tidbits in there. Off the top without more information, I would venture that settings you have in slurm.conf end up backfilling the smaller jobs at the expense of scheduling the larger jobs. Your partition configs plus accounting and scheduler configs from slurm.conf would be helpful. Also, search for "job starvation" here: https://slurm.schedmd.com/sched_config.html<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsched_config.html&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cea23798d0ad54a02f14308d6ae0883d5%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=KfjAqNHQgLcUBBYwZFi8OygU2De%2FdVuTwbdOmUv0Dps%3D&reserved=0> as another potential starting point. Best, Cyrus On 3/21/19 8:55 AM, David Baker wrote: Hello, I understand that this is not a straight forward question, however I'm wondering if anyone has any useful ideas, please. Our cluster is busy and the QOS has limited users to a maximum of 32 compute nodes on the "batch" queue. Users are making good of the cluster -- for example one user is running five 6 node jobs at the moment. On the other hand, a job belonging to another user has been stalled in the queue for around 7 days. He has made reasonable use of the cluster and as a result his fairshare component is relatively low. Having said that, the priority of his job is high -- it currently one of the highest priority jobs in the batch partition queue. From sprio... JOBID PARTITION PRIORITYAGE FAIRSHAREJOBSIZE PARTITION QOS 359323 batch 180292 10 79646547100 0 I did think that the PriorityDecayHalfLife was quite high at 14 days and so I reduced that to 7 days. For reference I've included the key scheduling settings from the cluster below. Does anyone have any thoughts, please? Best regards, David PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME,FAIR_TREE PriorityMaxAge = 7-00:00:00 PriorityUsageResetPeriod = NONE PriorityType= priority/multifactor PriorityWeightAge = 10 PriorityWeightFairShare = 100 PriorityWeightJobSize = 1000 PriorityWeightPartition = 1000 PriorityWeightQOS = 1 slurm.conf Description: slurm.conf
Re: [slurm-users] Very large job getting starved out
Hello, Running the command "squeue -j 359323 --start" gives me the following output... JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 359323 batchbatch jwk PD N/A 27 (null) (Resources) Best regards, David From: slurm-users on behalf of Christopher Samuel Sent: 21 March 2019 17:54 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Very large job getting starved out On 3/21/19 6:55 AM, David Baker wrote: > it currently one of the highest priority jobs in the batch partition queue What does squeue -j 359323 --start say? -- Chris Samuel : https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cca6bf6021f694fd32e6908d6ae265930%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=WYLLite%2BAMhQPkkpUzyJavT3nDEoN6mq1uofki8aa1A%3D&reserved=0 : Berkeley, CA, USA
[slurm-users] Backfill advice
Hello, We do have large jobs getting starved out on our cluster, and I note particularly that we never manage to see a job getting assigned a start time. It seems very possible that backfilled jobs are stealing nodes reserved for large/higher priority jobs. I'm wondering if our backfill configuration has any bearing on this issue or whether we are unfortunate enough to have hit a bug. One parameter that is missing in our bf setup is "bf_continue". Is that parameter significant in terms of ensuring that bf drills down sufficiently in the job mix? Also we are using the default bf frequency -- should we really reduce the frequency and potentially reduce the number of bf jobs per group/user or total at each iteration? Currently, I think we are setting the per/user limit to 20. Any thoughts would be appreciated, please. Best regards, David
Re: [slurm-users] Backfill advice
Hello Doug, Thank you for your detailed reply regarding how to setup backfill. There's quite a lot to take in there. Fortunately, I now have a day or two to read up and understand the ideas now that our cluster is down due to a water cooling failure. In the first instance, I'll certainly implement bf_continue and review/amend the "bf_maxjobs" and "bf_interval" parameters. Switching on backfill debugging sounds very useful, but does that setting tend to blot the logs if left enabled for long periods? We did have a contract with SchedMD which recently finished. In one of the last discussions we had it was intimated that we may have hit a bug. That's in the respect that backfilled jobs were potentially stealing nodes intended for higher priority jobs -- bug 5297. The advice was to consider upgrading to slurm 18.08.4 and implement bf_ignore_newly_avail_nodes. I was interested to see that you had a similar discussion with SchedMD and did upgrade. I think I ought to update the bf configuration re my first paragraph and see how that goes before we bite the bullet and do the upgrade (we are at 18.08.0 currently). Best regards, David From: slurm-users on behalf of Douglas Jacobsen Sent: 23 March 2019 13:30 To: Slurm User Community List Subject: Re: [slurm-users] Backfill advice Hello, At first blush bf_continue and bf_interval as well as bf_maxjobs (if I remembered the parameter correctly) are critical first steps in tuning. Setting DebugFlags=backfill is essential to getting the needed data to make tuning decisions. Use of per user/account settings if they are too low can also cause starvation depending on the way your priority calculation is set up. I presented these slides a few years ago ag the slurm user group on this topic: https://slurm.schedmd.com/SLUG16/NERSC.pdf<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2FSLUG16%2FNERSC.pdf&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C479a97721a87443f81c708d6af9455dd%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=OC57jAeprk%2Bm1tCCn20mtAVzfvmbcj4AgJ5b3wVPJKI%3D&reserved=0> The key thing to keep in mind with large jobs is that slurm needs to evaluate them again and again in the same order or the scheduled time may drift. Thus it is important that once jobs are getting planning reservations they must continue to do so. Because of the prevalence of large jobs at our site we use bf_min_prio_resv which splits the priority space into a reserving and non-reserving set, and then use job age to allow jobs to age from the non reserving portion of the priority space to the reservation portion. Use of the recent MaxJobsAccruePerUser limits on a job qos can throttle the rate of jobs aging and prevent negative effects from users submitting large numbers of jobs. I realize that is a large number of tunables and concepts densely packed, but it should give you some reasonable starting points. Doug On Sat, Mar 23, 2019 at 05:26 david baker mailto:djbake...@gmail.com>> wrote: Hello, We do have large jobs getting starved out on our cluster, and I note particularly that we never manage to see a job getting assigned a start time. It seems very possible that backfilled jobs are stealing nodes reserved for large/higher priority jobs. I'm wondering if our backfill configuration has any bearing on this issue or whether we are unfortunate enough to have hit a bug. One parameter that is missing in our bf setup is "bf_continue". Is that parameter significant in terms of ensuring that bf drills down sufficiently in the job mix? Also we are using the default bf frequency -- should we really reduce the frequency and potentially reduce the number of bf jobs per group/user or total at each iteration? Currently, I think we are setting the per/user limit to 20. Any thoughts would be appreciated, please. Best regards, David -- Sent from Gmail Mobile
[slurm-users] Slurm users meeting 2019?
Hello, I was searching the web to see if there was going to be a Slurm users’ meeting this year, but couldn’t find anything. Does anyone know if there is a users’ meeting planned for 2019? If so, is it most likely going to be held as part of Supercomputing in Denver? Please let me know if you know what’s planned this year. Best regards, David Sent from my iPad
Re: [slurm-users] Slurm users meeting 2019?
Thank you for the date and location of the this year's Slurm User Group Meeting. Best regards, David From: slurm-users on behalf of Jacob Jenson Sent: 25 March 2019 21:26:45 To: Slurm User Community List Subject: Re: [slurm-users] Slurm users meeting 2019? The 2019 Slurm User Group Meeting will be held in Salt Lake City at the University of Utah on September 17-18. Registration for this user group meeting typically opens in May. Jacob On Mon, Mar 25, 2019 at 2:57 PM david baker mailto:djbake...@gmail.com>> wrote: Hello, I was searching the web to see if there was going to be a Slurm users’ meeting this year, but couldn’t find anything. Does anyone know if there is a users’ meeting planned for 2019? If so, is it most likely going to be held as part of Supercomputing in Denver? Please let me know if you know what’s planned this year. Best regards, David Sent from my iPad
[slurm-users] Effect of PriorityMaxAge on job throughput
Hello, I've finally got the job throughput/turnaround to be reasonable in our cluster. Most of the time the job activity on the cluster sets the default QOS to 32 nodes (there are 464 nodes in the default queue). Jobs requesting nodes close to the QOS level (for example 22 nodes) are scheduled within 24 hours which is better than it has been. Still I suspect there is room for improvement. I note that these large jobs still struggle to be given a starttime, however many jobs are now being given a starttime following my SchedulerParameters makeover. I used advice from the mailing list and the Slurm high throughput document to help me make changes to the scheduling parameters. They are now... SchedulerParameters=assoc_limit_continue,batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_window=3600,bf_resolution=600,bf_yield_interval=100,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=200 Also.. PriorityFavorSmall=NO PriorityFlags=SMALL_RELATIVE_TO_TIME,ACCRUE_ALWAYS,FAIR_TREE PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityMaxAge=1-0 The most significant change was actually reducing "PriorityMaxAge" to 7-0 to 1-0. Before that change the larger jobs could hang around in the queue for days. Does it make sense therefore to further reduce PriorityMaxAge to less than 1 day? Your advice would be appreciated, please. Best regards, David
Re: [slurm-users] Effect of PriorityMaxAge on job throughput
Michael, Thank you for your reply and your thoughts. These are the priority weights that I have configured in the slurm.conf. PriorityWeightFairshare=100 PriorityWeightAge=10 PriorityWeightPartition=1000 PriorityWeightJobSize=1000 PriorityWeightQOS=1 I've made the PWJobSize to be the highest factor, however I understand that that only provides a once-off kick to jobs and so it probably insignificant in the longer run . That's followed by the PWFairshare. Should I really be looking at increasing the PWAge factor to help to "push jobs" through the system? The other issue that might play a part is that we see a lot of single node jobs (presumably backfilled) into the system. Users aren't excessively bombing the cluster, but maybe some backfill throttling would be useful as well (?) What are your thoughts having seen the priority factors, please? I've attached a copy of the slurm.conf just in case you or anyone else wants to take a more complete overview. Best regards, David From: slurm-users on behalf of Michael Gutteridge Sent: 09 April 2019 18:59 To: Slurm User Community List Subject: Re: [slurm-users] Effect of PriorityMaxAge on job throughput It might be useful to include the various priority factors you've got configured. The fact that adjusting PriorityMaxAge had a dramatic effect suggests that the age factor is pretty high- might be worth looking at that value relative to the other factors. Have you looked at PriorityWeightJobSize? Might have some utility if you're finding large jobs getting short-shrift. - Michael On Tue, Apr 9, 2019 at 2:01 AM David Baker mailto:d.j.ba...@soton.ac.uk>> wrote: Hello, I've finally got the job throughput/turnaround to be reasonable in our cluster. Most of the time the job activity on the cluster sets the default QOS to 32 nodes (there are 464 nodes in the default queue). Jobs requesting nodes close to the QOS level (for example 22 nodes) are scheduled within 24 hours which is better than it has been. Still I suspect there is room for improvement. I note that these large jobs still struggle to be given a starttime, however many jobs are now being given a starttime following my SchedulerParameters makeover. I used advice from the mailing list and the Slurm high throughput document to help me make changes to the scheduling parameters. They are now... SchedulerParameters=assoc_limit_continue,batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_window=3600,bf_resolution=600,bf_yield_interval=100,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=200 Also.. PriorityFavorSmall=NO PriorityFlags=SMALL_RELATIVE_TO_TIME,ACCRUE_ALWAYS,FAIR_TREE PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityMaxAge=1-0 The most significant change was actually reducing "PriorityMaxAge" to 7-0 to 1-0. Before that change the larger jobs could hang around in the queue for days. Does it make sense therefore to further reduce PriorityMaxAge to less than 1 day? Your advice would be appreciated, please. Best regards, David slurm.conf Description: slurm.conf
Re: [slurm-users] Effect of PriorityMaxAge on job throughput
Hello Michael, Thank you for your email and apologies for my tardy response. I'm still sorting out my mailbox after an Easter break. I've taken your comments on board and I'll see how I go with your suggestions. Best regards, David From: slurm-users on behalf of Michael Gutteridge Sent: 16 April 2019 16:43 To: Slurm User Community List Subject: Re: [slurm-users] Effect of PriorityMaxAge on job throughput (sorry, kind of fell asleep on you there...) I wouldn't expect backfill to be a problem since it shouldn't be starting jobs that won't complete before the priority reservations start. We allow jobs to go over (overtimelimit) so in our case it can be a problem. On one of our cloud clusters we had problems with large jobs getting starved so we set "assoc_limit_stop" in the scheduler parameters- I think for your config it would require removing "assoc_limit_continue" (we're on Slurm 18 and _continue is the default, replaced by _stop if you want that behavior). However, there we use the builtin scheduler- I'd imagine this would play heck with a fairshare/backfill cluster (like our on-campus) though. However, it is designed to prevent large-job starvation. We'd also had some issues with fairshare hitting the limit pretty quickly- basically it stopped being a useful factor in calculating priority- so we set FairShareDampeningFactor to 5 to get a little more utility out of that. I'd suggest looking at the output of sprio to see how your factors are working in situ, particularly when you've got a stuck large job. It may be that the SMALL_RELATIVE_TO_TIME could be washing out the job size factor if your larger jobs are also longer. HTH. M On Wed, Apr 10, 2019 at 2:46 AM David Baker mailto:d.j.ba...@soton.ac.uk>> wrote: Michael, Thank you for your reply and your thoughts. These are the priority weights that I have configured in the slurm.conf. PriorityWeightFairshare=100 PriorityWeightAge=10 PriorityWeightPartition=1000 PriorityWeightJobSize=1000 PriorityWeightQOS=1 I've made the PWJobSize to be the highest factor, however I understand that that only provides a once-off kick to jobs and so it probably insignificant in the longer run . That's followed by the PWFairshare. Should I really be looking at increasing the PWAge factor to help to "push jobs" through the system? The other issue that might play a part is that we see a lot of single node jobs (presumably backfilled) into the system. Users aren't excessively bombing the cluster, but maybe some backfill throttling would be useful as well (?) What are your thoughts having seen the priority factors, please? I've attached a copy of the slurm.conf just in case you or anyone else wants to take a more complete overview. Best regards, David From: slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Michael Gutteridge mailto:michael.gutteri...@gmail.com>> Sent: 09 April 2019 18:59 To: Slurm User Community List Subject: Re: [slurm-users] Effect of PriorityMaxAge on job throughput It might be useful to include the various priority factors you've got configured. The fact that adjusting PriorityMaxAge had a dramatic effect suggests that the age factor is pretty high- might be worth looking at that value relative to the other factors. Have you looked at PriorityWeightJobSize? Might have some utility if you're finding large jobs getting short-shrift. - Michael On Tue, Apr 9, 2019 at 2:01 AM David Baker mailto:d.j.ba...@soton.ac.uk>> wrote: Hello, I've finally got the job throughput/turnaround to be reasonable in our cluster. Most of the time the job activity on the cluster sets the default QOS to 32 nodes (there are 464 nodes in the default queue). Jobs requesting nodes close to the QOS level (for example 22 nodes) are scheduled within 24 hours which is better than it has been. Still I suspect there is room for improvement. I note that these large jobs still struggle to be given a starttime, however many jobs are now being given a starttime following my SchedulerParameters makeover. I used advice from the mailing list and the Slurm high throughput document to help me make changes to the scheduling parameters. They are now... SchedulerParameters=assoc_limit_continue,batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_window=3600,bf_resolution=600,bf_yield_interval=100,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=200 Also.. PriorityFavorSmall=NO PriorityFlags=SMALL_RELATIVE_TO_TIME,ACCRUE_ALWAYS,FAIR_TREE PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityMaxAge=1-0 The most significant change was actually reducing "PriorityMaxAge" to 7-0 to 1-0. Before that change the lar
[slurm-users] Slurm database failure messages
Hello, We are experiencing quite a number of database failures. We saw an outright failure a short while ago where we had to restart the maria database and the slurmdbd process. After restarting the database appear to be working well, however over the last few days I have notice quite a number of failures. For example -- see below. Does anyone understand what might be going wrong, why and whether we should be concerned, please? I understand that slurm databases can get quite large relatively quickly and so I wonder if this is memory related. Best regards, David [root@blue51 slurm]# less slurmdbd.log-20190506.gz | grep failed [2019-05-05T04:00:05.603] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction [2019-05-05T04:00:05.606] error: Cluster i5 rollup failed [2019-05-05T23:00:07.017] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction [2019-05-05T23:00:07.018] error: Cluster i5 rollup failed [2019-05-06T00:00:13.348] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction [2019-05-06T00:00:13.350] error: Cluster i5 rollup failed
[slurm-users] Partition QOS limits not being applied
Hi SLURM users, I work on a cluster, and we recently transitioned to using SLURM on some of our nodes. However, we're currently having some difficulty limiting the number of jobs that a user can run simultaneously in particular partitions. Here are the steps we've taken: 1. Created a new QOS and set MaxJobsPerUser=4 with sacctmgr. 2. Modified slurm.conf so that the relevant PartitionName line includes QOS=. 3. Restarted slurmctld. However, after taking these steps, the partition in question still does not have any limits on the number of jobs that a user can run simultaneously. Is there something wrong here, or are there additional steps that we need to take? Any advice is greatly appreciated! Thanks, Dave -- Dave Carlson PhD Candidate Ecology and Evolution Department Stony Brook University
[slurm-users] Testing/evaluating new versions of slurm (19.05 in this case)
Hello, Following the various postings regarding slurm 19.05 I thought it was an opportune time to send this question to the forum. Like others I'm awaiting 19.05 primarily due to the addition of the XFACTOR priority setting, but due to other new/improved features as well. I'm interested to hear how other admins/groups test (and stress) new versions of slurm. That is, how do admins test a new version with a (a) realistic workload and (b) with sufficient hardware resources with taking too many hardware resources from their production cluster and/or annoying too many users? I understand that it is possible to emulate a large cluster on SMP nodes by firing up many slurm processes on those nodes, for example. I have been experimenting with a slurm simulator (https://github.com/ubccr-slurm-simulator/slurm_sim_tools/blob/master/doc/slurm_sim_manual.Rmd) using historical job data, however that simulator is based on an old version of slurm and (to be honest) it's slightly unreliable for serious study. It's certainly only useful for broad brush analysis, at the most. Please let me have your thoughts -- they would be appreciated. Best regards, David
[slurm-users] Updating slurm priority flags
Hello, I have a quick question regarding updating the priority flags in the slurm.conf file. Currently I have the flag "small_relative_to_time" set. I'm finding that that flag is washing out the job size priority weight factor and I would like to experiment without it. So when you remove that priority flag from the configuration should slurm automatically update the job size priority weight factor for the existing jobs? I am concerned that existing jobs will not have their priority changed. Does anyone know how to make this sort of change without adversely affecting the "dynamics" of existing and new jobs in the cluster? That is, I don't want existing jobs to lose out cf new jobs re overall priority. Your advice would be appreciated, please. Best regards, David
[slurm-users] Advice on setting up fairshare
Hello, Could someone please give me some advice on setting up the fairshare in a cluster. I don't think the present setup is wildly incorrect, however either my understanding of the setup is wrong or something is reconfigured. When we set a new user up on the cluster and they haven't used any resources am I correct in thinking that their fairshare (as reported by sshare -a) should be 1.0? Looking at a new user, I see... [root@blue52 slurm]# sshare -a | grep rk1n15 soton rk1n15 10.003135 0 0.00 0.822165 This is a very simple setup. We have a number of groups (all under root)... soton -- general public hydrology - specific groups that have purchased their own nodes. relgroup worldpop What I do for each of these groups, when a new user is added, is increment the number of shares per the relevant group using, for example... sacctmgr modify account soton set fairshare=X Where X is the number of users in the group (soton in this case). The sshare -a command would give me a global overview... Account User RawShares NormSharesRawUsage EffectvUsage FairShare -- -- --- --- - -- root 0.00 15431286261 1.00 root root 10.002755 40 0.00 1.00 hydrology 30.008264 1357382 0.88 hydrology da1g18 10.33 0 0.00 0.876289 Does that all make sense or am I missing something? I am, by the way, using the line PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE in my slurm.conf. Best regards, David
Re: [slurm-users] Advice on setting up fairshare
Hi Loris, Thank you for your reply. I had started to read about 'Fairshare=parent' and wondered if that was the way to go. So that all makes sense. I set 'fairshare=parent' at the account levels and that does the job very well. Things are looking much better and now new (and eternally idle) users receive a fairshare of 1 as expected. It certainly makes the scripts/admin a great deal less cumbersome. Best regards, David From: slurm-users on behalf of Loris Bennett Sent: 07 June 2019 07:11:36 To: Slurm User Community List Subject: Re: [slurm-users] Advice on setting up fairshare Hi David, I have had time to look into your current problem, but inline I have some comments about the general approach. David Baker writes: > Hello, > > Could someone please give me some advice on setting up the fairshare > in a cluster. I don't think the present setup is wildly incorrect, > however either my understanding of the setup is wrong or something is > reconfigured. > > When we set a new user up on the cluster and they haven't used any > resources am I correct in thinking that their fairshare (as reported > by sshare -a) should be 1.0? Looking at a new user, I see... > > [root@blue52 slurm]# sshare -a | grep rk1n15 > soton rk1n15 1 0.003135 0 0.00 0.822165 > > This is a very simple setup. We have a number of groups (all under > root)... > > soton -- general public > > hydrology - specific groups that have purchased their own nodes. > > relgroup > > worldpop > > What I do for each of these groups, when a new user is added, is > increment the number of shares per the relevant group using, for > example... > > sacctmgr modify account soton set fairshare=X > > Where X is the number of users in the group (soton in this case). I did this for years, wrote added logic to automatically increment/decrement shares when user were added/deleted/moved, but recently realised that for our use-case it is not necessary. The way shares are seem to be intended to work is that some project gets a fixed allocation on the system, or some group buys a certain number of node for the cluster. Shares are then dished out based on the size of the project or number of nodes and are thus fairly static. You seem to have more of a setup like we do: a centrally financed system which is free to use and where everyone is treated equally. What we now do is have the Fairshare parameter for all accounts in the hierarchy set to "Parent". This means that everyone ends up with one normalised share and no changes have to be propagated through the hierarchy. We also added creating the Slurm association to the submit plugin, so that if someone applies for access, but never logs in, we can remove them from the system after four weeks without having to clear up in Slurm as well. Maybe this kind of approach might work for you, too. Cheers, Loris > The sshare -a command would give me a global overview... > > Account User RawShares NormShares RawUsage EffectvUsage FairShare > -- -- --- --- > - -- > root 0.00 15431286261 1.00 > root root 1 0.002755 40 0.00 1.00 > hydrology 3 0.008264 1357382 0.88 > hydrology da1g18 1 0.33 0 0.00 0.876289 > > > Does that all make sense or am I missing something? I am, by the way, > using the line > > PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE in my slurm.conf. > > Best regards, > > David > > -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
[slurm-users] Deadlocks in slurmdbd logs
Hello, Everyday we see several deadlocks in our slurmdbd log file. Together with the deadlock we always see a failed "roll up" operation. Please see below for an example. We are running slurm 18.08.0 on our cluster. As far as we know these deadlocks are not adversely affecting the operation of the cluster. Each day jobs are "rolling" through the cluster and the utilisation of the cluster is constantly high. Furthermore, it doesn't appear that we are losing data in the database. I'm not a database expert and so I have no idea where to start with this. Our local db experts have taken a look and are nonplussed. I wondered if anyone in the community had any ideas please. As an aside I've just started to experiment with v19* and it would be nice to think that these deadlocks will just go away in due course (following an eventual upgrade when that version is a bit more mature), however that may not be the case. Best regards, David [2019-06-19T00:00:02.728] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction insert into "i5_assoc_usage_hour_table" . [2019-06-19T00:00:02.729] error: Couldn't add assoc hour rollup [2019-06-19T00:00:02.729] error: Cluster i5 rollup failed
[slurm-users] Requirement to run longer jobs
Hello, A few of our users have asked about running longer jobs on our cluster. Currently our main/default compute partition has a time limit of 2.5 days. Potentially, a handful of users need jobs to run up to 5 hours. Rather than allow all users/jobs to have a run time limit of 5 days I wondered if the following scheme makes sense... Increase the max run time on the default partition to be 5 days, however limit most users to a max of 2.5 days using the default "normal" QOS. Create a QOS called "long" with a max time limit of 5 days. Limit the user who can use "long". For authorized users assign "long" QOS to their jobs on basis of run time request. Does the above make sense or is it too complicated? If the above works could users limited to using the normal QOS have their running jobs run time increased to 5 days in exceptional circumstances? I would be interested in your thoughts, please. Best regards, David
Re: [slurm-users] Requirement to run longer jobs
Hello, Thank you to everyone who replied to my email. I'll need to experiment and see how I get on. Best regards, David From: slurm-users on behalf of Loris Bennett Sent: 04 July 2019 06:53 To: Slurm User Community List Subject: Re: [slurm-users] Requirement to run longer jobs Hi Chris, Chris Samuel writes: > On 3/7/19 8:49 am, David Baker wrote: > >> Does the above make sense or is it too complicated? > > [looks at our 14 partitions and 112 QOS's] > > Nope, that seems pretty simple. We do much the same here. Out of interest, how many partitions and QOSs would an average user actually every use? I'm coming from a very simple set-up which originally had just 3 partitions and 3 QOSs. We have now gone up to 6 partitions and I'm already worrying that it's getting too complicated 😅 Cheers, Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
Re: [slurm-users] Invalid qos specification
I ran into this recently. You need to make sure your user account has access to that QoS through sacctmgr. Right now I'd say if you did sacctmgr show user withassoc that the QoS you're attempting to use is NOT listed as part of the association. On Mon, Jul 15, 2019 at 2:53 PM Prentice Bisbal wrote: > Slurm users, > > I have created a partition named general should allow the QOSes > 'general' and 'debug': > > PartitionName=general Default=YES AllowQOS=general,debug Nodes=. > > However, when I try to request that QOS, I get an error: > > $ salloc -p general -q debug -t 00:30:00 > salloc: error: Job submit/allocate failed: Invalid qos specification > > I'm sure I'm overlooking something obvious. Any idea what that may be? > I'm using slurm 18.08.8 on the slurm controller, and the clients are > still at 18.08.7 until tomorrow morning. > > -- > Prentice > > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
Re: [slurm-users] Cluster-wide GPU Per User limit
Unfortunately, I think you're stuck in setting it at the account level with sacctmgr. You could also set that limit as part of a QoS and then attach the QoS to the partition. But I think that's as granular as you can get for limiting TRES'. HTH! David On Wed, Jul 17, 2019 at 10:11 AM Mike Harvey wrote: > > Is it possible to set a cluster level limit of GPUs per user? We'd like > to implement a limit of how many GPUs a user may use across multiple > partitions at one time. > > I tried this, but it obviously isn't correct: > > # sacctmgr modify cluster slurm_cluster set MaxTRESPerUser=gres/gpu=2 > Unknown option: MaxTRESPerUser=gres/gpu=2 > Use keyword 'where' to modify condition > > > Thanks! > > -- > Mike Harvey > Systems Administrator > Engineering Computing > Bucknell University > har...@bucknell.edu > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
Re: [slurm-users] No error/output/run
>From your email it looks like you submitted the job, ran squeue and saw that it either didn't start or completed very quickly. I'd start with the job ExitCode info from sacct. On Wed, Jul 24, 2019 at 4:34 AM Mahmood Naderan wrote: > Hi, > I don't know why no error/output file is generated after the job > submission. > > $ ls -l > total 8 > -rw-r--r-- 1 montazeri montazeri 472 Jul 24 12:52 in.lj > -rw-rw-r-- 1 montazeri montazeri 254 Jul 24 12:53 slurm_script.sh > $ cat slurm_script.sh > #!/bin/bash > #SBATCH --job-name=my_lammps > #SBATCH --output=out.lj > #SBATCH --partition=EMERALD > #SBATCH --account=z55 > #SBATCH --mem=4GB > #SBATCH --nodes=4 > #SBATCH --ntasks-per-node=3 > mpirun -np 12 /share/apps/softwares/lammps-12Dec18/src/lmp_mpi -in in.lj > > $ sbatch slurm_script.sh > Submitted batch job 1277 > $ squeue > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > $ ls > in.lj slurm_script.sh > $ > > > What does that mean? > > Regards, > Mahmood > > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
[slurm-users] Slurm node weights
Hello, I'm experimenting with node weights and I'm very puzzled by what I see. Looking at the documentation I gathered that jobs will be allocated to the nodes with the lowest weight which satisfies their requirements. I have 3 nodes in a partition and I have defined the nodes like so.. NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN Weight=50 NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN So, given that the default weight is 1 I would expect jobs to be allocated to orange02 and orange03 first. I find, however that my test job is always allocated to orange01 with the higher weight. Have I overlooked something? I would appreciate your advice, please.
Re: [slurm-users] Slurm node weights
Hello, As an update I note that I have tried restarting the slurmctld, however that doesn't help. Best regards, David From: slurm-users on behalf of David Baker Sent: 25 July 2019 11:47:35 To: slurm-users@lists.schedmd.com Subject: [slurm-users] Slurm node weights Hello, I'm experimenting with node weights and I'm very puzzled by what I see. Looking at the documentation I gathered that jobs will be allocated to the nodes with the lowest weight which satisfies their requirements. I have 3 nodes in a partition and I have defined the nodes like so.. NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN Weight=50 NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN So, given that the default weight is 1 I would expect jobs to be allocated to orange02 and orange03 first. I find, however that my test job is always allocated to orange01 with the higher weight. Have I overlooked something? I would appreciate your advice, please.
Re: [slurm-users] Slurm node weights
Hello, Thank you for the replies. We're running an early version of Slurm 18.08 and it does appear that the node weights are being ignored re the bug. We're experimenting with Slurm 19*, however we don't expect to deploy that new version for quite a while. In the meantime does anyone know if there any fix or alternative strategy that might help us to achieve the same result? Best regards, David From: slurm-users on behalf of Sarlo, Jeffrey S Sent: 25 July 2019 12:26 To: Slurm User Community List Subject: Re: [slurm-users] Slurm node weights Which version of Slurm are you running? I know some of the earlier versions of 18.08 had a bug and node weights were not working. Jeff From: slurm-users on behalf of David Baker Sent: Thursday, July 25, 2019 6:09 AM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Slurm node weights Hello, As an update I note that I have tried restarting the slurmctld, however that doesn't help. Best regards, David From: slurm-users on behalf of David Baker Sent: 25 July 2019 11:47:35 To: slurm-users@lists.schedmd.com Subject: [slurm-users] Slurm node weights Hello, I'm experimenting with node weights and I'm very puzzled by what I see. Looking at the documentation I gathered that jobs will be allocated to the nodes with the lowest weight which satisfies their requirements. I have 3 nodes in a partition and I have defined the nodes like so.. NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN Weight=50 NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN So, given that the default weight is 1 I would expect jobs to be allocated to orange02 and orange03 first. I find, however that my test job is always allocated to orange01 with the higher weight. Have I overlooked something? I would appreciate your advice, please.
Re: [slurm-users] Slurm node weights
Hi Jeff, Thank you for these details. so far we have never implemented any Slurm fixes. I suspect the node weights feature is quite important and useful, and it's probably worth me investigating this fix. In this respect could you please advise me? If I use the fix to regenerate the "slurm-slurmd" rpm can I then stop the slurmctld processes on the servers, re-install the revised rpm and finally restart the slurmctld processes? Most importantly, can this replacement/fix be done on a live system that is running jobs, etc? That's assuming that we regard/announce the system to be at risk. Or alternatively, do we need to arrange downtime, etc? Best regards, David From: slurm-users on behalf of Sarlo, Jeffrey S Sent: 25 July 2019 13:04 To: Slurm User Community List Subject: Re: [slurm-users] Slurm node weights This is the fix if you want to modify the code and rebuild https://github.com/SchedMD/slurm/commit/f66a2a3e2064<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fcommit%2Ff66a2a3e2064&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc72db5f7dab1400983e008d710f8840c%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=bhMG78N1%2FQ2ZInn599QuEQ6tyD5pRXAIomlNja1f3j0%3D&reserved=0> I think 18.08.04 and later have it fixed. Jeff ____ From: slurm-users on behalf of David Baker Sent: Thursday, July 25, 2019 6:53 AM To: Slurm User Community List Subject: Re: [slurm-users] Slurm node weights Hello, Thank you for the replies. We're running an early version of Slurm 18.08 and it does appear that the node weights are being ignored re the bug. We're experimenting with Slurm 19*, however we don't expect to deploy that new version for quite a while. In the meantime does anyone know if there any fix or alternative strategy that might help us to achieve the same result? Best regards, David From: slurm-users on behalf of Sarlo, Jeffrey S Sent: 25 July 2019 12:26 To: Slurm User Community List Subject: Re: [slurm-users] Slurm node weights Which version of Slurm are you running? I know some of the earlier versions of 18.08 had a bug and node weights were not working. Jeff ____ From: slurm-users on behalf of David Baker Sent: Thursday, July 25, 2019 6:09 AM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Slurm node weights Hello, As an update I note that I have tried restarting the slurmctld, however that doesn't help. Best regards, David ____ From: slurm-users on behalf of David Baker Sent: 25 July 2019 11:47:35 To: slurm-users@lists.schedmd.com Subject: [slurm-users] Slurm node weights Hello, I'm experimenting with node weights and I'm very puzzled by what I see. Looking at the documentation I gathered that jobs will be allocated to the nodes with the lowest weight which satisfies their requirements. I have 3 nodes in a partition and I have defined the nodes like so.. NodeName=orange01 Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN Weight=50 NodeName=orange[02-03] Procs=48 Sockets=8 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=1018990 State=UNKNOWN So, given that the default weight is 1 I would expect jobs to be allocated to orange02 and orange03 first. I find, however that my test job is always allocated to orange01 with the higher weight. Have I overlooked something? I would appreciate your advice, please.
[slurm-users] Slurm statesave directory -- location and management
Hello, I apologise that this email is a bit vague, however we are keen to understand the role of the Slurm "StateSave" location. I can see the value of the information in this location when, for example, we are upgrading Slurm and the database is temporarily down, however as I note above we are keen to gain a much better understanding of this directory. We have two Slurm controller nodes (one of them is a backup controller), and currently we have put the "StateSave" directory on one of the global GPFS file stores. In other respects Slurm operates independently of the GPFS file stores -- apart from the fact that if GPFS fails jobs will subsequently fail. There was a GPFS failure when I was away from the university. Once GPFS had been restored they attempted to start Slurm, however the StateSave data was out of date. They eventually restarted Slurm, however lost all the queued jobs and the job sequence counter restarted at one. Am I correct in thinking the the information in the StateSave location relates to the state of (a) jobs currently running on the cluster and (b) jobs queued? Am I also correct in thinking that this information is not stored in the slurm database? In other words if you lose the statesave data or it gets corrupted then you will lose all running/queued jobs? Any advice on the management and location of the statesave directory in a dual controller system would be appreciated, please. Best regards, David
[slurm-users] oddity with users showing in sacctmgr and sreport
Hello, First issue: I have a couple dozen users that show up for an account but outside of the hierarchical structure in sacctmgr: sacctmgr show assoc account= format=Account,User,Cluster,ParentName%20 Tree When I execute that on a given account, I see that one user resides outside account where the account specifies its Parent Name (i.e. the user is part of the account, just not tucked correctly into the hierarchy): Account User Cluster Par Name -- -- acct rogueuser mycluster acct mycluster acct_root acctnormaluser mycluster Second issue: A couple of rogue users from within sacctmgr show up in sreport output ABOVE the account usage: sreport -T billing cluster AccountUtilizationByUser Accounts= Start=2019-08-01 End=2019-09-11 returns the output, but the first line in the output given is that of a USER and not the specified account as noted in the man sreport section: cluster AccountUtilizationByUser This report will display account utilization as it appears on the hierarchical tree. Starting with the specified account or the root account by default this report will list the underlying usage with a sum on each level. Use the 'tree' option to span the tree for better visibility. NOTE: If there were reservations allowing a whole account any idle time in the reser‐ vation given to the association for the account, not the user associations in the account, so it can be possible a parent account can be larger than the sum of it's children. Has anyone seen either of these behaviors? I've even queried the DB just to see if there wasn't something more obvious as to the issue, but I can't find anything. The associations are very tidy in the DB. When I dump the cluster using sacctmgr I can see the handful of rogue associations all the way at the top of the list, meaning they aren't a part of the root hierarchy in sacctmgr. We're using 18.08.7. Thanks! -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
Re: [slurm-users] Maxjobs not being enforced
Hi, Tina, Could you send the command you ran? David On Tue, Sep 17, 2019 at 2:06 PM Tina Fora wrote: > Hello Slurm user, > > We have 'AccountingStorageEnforce=limits,qos' set in our slurm.conf. I've > added maxjobs=100 for a particular user causing havoc on our shared > storage. This setting is still not being enforced and the user is able to > launch 1000s of jobs. > > I also ran 'scontrol reconfig' and even restarted slurmd on the computes > but no luck. I'm on 17.11. Are there additional steps to limit a user? > > Best, > T > > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
Re: [slurm-users] Maxjobs not being enforced
Hi, Tina, Are you able to confirm whether or not you can view the limit for the user in scontrol as well? David On Tue, Sep 17, 2019 at 4:42 PM Tina Fora wrote: > > # sacctmgr modify user lif6 set maxjobs=100 > > # sacctmgr list assoc user=lif6 format=user,maxjobs,maxsubmit,maxtresmins > User MaxJobs MaxSubmit MaxTRESMins > -- --- - - > lif6 100 > > > > > Hi, Tina, > > > > Could you send the command you ran? > > > > David > > > > On Tue, Sep 17, 2019 at 2:06 PM Tina Fora wrote: > > > >> Hello Slurm user, > >> > >> We have 'AccountingStorageEnforce=limits,qos' set in our slurm.conf. > >> I've > >> added maxjobs=100 for a particular user causing havoc on our shared > >> storage. This setting is still not being enforced and the user is able > >> to > >> launch 1000s of jobs. > >> > >> I also ran 'scontrol reconfig' and even restarted slurmd on the computes > >> but no luck. I'm on 17.11. Are there additional steps to limit a user? > >> > >> Best, > >> T > >> > >> > >> > > > > -- > > David Rhey > > --- > > Advanced Research Computing - Technology Services > > University of Michigan > > > > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
[slurm-users] Advice on setting a partition QOS
Hello, I have defined a partition and corresponding QOS in Slurm. This is the serial queue to which we route jobs that require up to (and including) 20 cpus. The nodes controlled by serial are shared. I've set the QOS like so.. [djb1@cyan53 slurm]$ sacctmgr show qos serial format=name,maxtresperuser Name MaxTRESPU -- - serial cpu=120 The max cpus/user is set high to try to ensure (as often as possible) that the nodes are all busy and not in mixed states. Obviously this cannot be the case all the time -- depending upon memory requirements, etc. I noticed that a number of jobs were pending with the reason QOSMaxNodePerUserLimit. I've tried firing test jobs to the queue myself and noticed that I can never have more than 32 jobs running (each requesting 1 cpu) and the rest are pending as per the reason above. Since the QOS cpu/user limit is set to 120 I would expect to be able to run more jobs -- given that some serial nodes are still not fully occupied. Furthermore, I note that other users appear not to be able to use more then 32 cpus in the queue. The 32 limit does make a degree of sense. The "normal" QOS is set to cpus/user=1280, nodes/user=32. It's almost like the 32 cpus in the serial queue are being counted as nodes -- as per the pending reason. Could someone please help me understand this issue and how to avoid it? Best regards, David
Re: [slurm-users] Advice on setting a partition QOS
Dear Jurgen, Thank you for your reply. So, in respond to your suggestion I submitted a batch of jobs each asking for 2 cpus. Again I was able to get 32 jobs running at once. I presume this is a weird interaction with the normal QOS. In that respect would it be best to redefine the normal OQS simply in terms of cpu/user usage? That is, not cpus/user and nodes/user. Best regards, David From: slurm-users on behalf of Juergen Salk Sent: 25 September 2019 14:52 To: Slurm User Community List Subject: Re: [slurm-users] Advice on setting a partition QOS Dear David, as it seems, Slurm counts allocated nodes on a per job basis, i.e. every individual one-core jobs counts as an additional node even if they all run on one and the same node. Can you allocate 64 CPUs at the same time when requesting 2 CPUs per job? We've also had this (somewhat strange) behaviour with Moab and therefore implemented limits based on processor counts rather than node counts per user. This is obviously no issue for exclusive node scheduling, but for non-exclusive nodes it is (or at least may be). Best regards Jürgen -- Jürgen Salk Scientific Software & Compute Services (SSCS) Kommunikations- und Informationszentrum (kiz) Universität Ulm Telefon: +49 (0)731 50-22478 Telefax: +49 (0)731 50-22471 * David Baker [190925 12:12]: > Hello, > > I have defined a partition and corresponding QOS in Slurm. This is > the serial queue to which we route jobs that require up to (and > including) 20 cpus. The nodes controlled by serial are shared. I've > set the QOS like so.. > > [djb1@cyan53 slurm]$ sacctmgr show qos serial format=name,maxtresperuser > Name MaxTRESPU > -- - > serial cpu=120 > > The max cpus/user is set high to try to ensure (as often as > possible) that the nodes are all busy and not in mixed states. > Obviously this cannot be the case all the time -- depending upon > memory requirements, etc. > > I noticed that a number of jobs were pending with the reason > QOSMaxNodePerUserLimit. I've tried firing test jobs to the queue > myself and noticed that I can never have more than 32 jobs running > (each requesting 1 cpu) and the rest are pending as per the reason > above. Since the QOS cpu/user limit is set to 120 I would expect to > be able to run more jobs -- given that some serial nodes are still > not fully occupied. Furthermore, I note that other users appear not > to be able to use more then 32 cpus in the queue. > > The 32 limit does make a degree of sense. The "normal" QOS is set to > cpus/user=1280, nodes/user=32. It's almost like the 32 cpus in the > serial queue are being counted as nodes -- as per the pending > reason. > > Could someone please help me understand this issue and how to avoid it? > > Best regards, > David
[slurm-users] How to modify the normal QOS
Hello, Currently my normal QOS specifies MaxTRESPU=cpu=1280,nodes=32. I've tried a number of edits, however I haven't yet found a way of redefining the MaxTRESPU to be "cpu=1280". In the past I have resorted to deleting a QOS completely and redefining the whole thing, but in this case I'm not sure if I can delete the normal QOS on a running cluster. I have tried commands like the following to no avail.. sacctmgr update qos normal set maxtresperuser=cpu=1280 Could anyone please help with this. Best regards, David
Re: [slurm-users] How to modify the normal QOS
Dear Jurgen, Thank you for that. That does the expected job. It looks like the weirdness that I saw in the serial partition has now gone away and so that is good. Best regards, David From: slurm-users on behalf of Juergen Salk Sent: 26 September 2019 16:18 To: Slurm User Community List Subject: Re: [slurm-users] How to modify the normal QOS * David Baker [190926 14:12]: > > Currently my normal QOS specifies MaxTRESPU=cpu=1280,nodes=32. I've > tried a number of edits, however I haven't yet found a way of > redefining the MaxTRESPU to be "cpu=1280". In the past I have > resorted to deleting a QOS completely and redefining the whole > thing, but in this case I'm not sure if I can delete the normal QOS > on a running cluster. > > I have tried commands like the following to no avail.. > > sacctmgr update qos normal set maxtresperuser=cpu=1280 > > Could anyone please help with this. Dear David, does this work for you? sacctmgr update qos normal set MaxTRESPerUser=node=-1 Best regards Jürgen -- Jürgen Salk Scientific Software & Compute Services (SSCS) Kommunikations- und Informationszentrum (kiz) Universität Ulm Telefon: +49 (0)731 50-22478 Telefax: +49 (0)731 50-22471
[slurm-users] Slurm very rarely assigned an estimated start time to a job
Hello., I have just started to take a look at Slurm v19* with a view to an upgrade (most likely in the next year). My reaction is that Slurm very rarely provides an estimated start time for a job. I understand that this is not possible for jobs on hold and dependent jobs. On the other hand I've just submitted a set of simple test jobs to our development cluster, and I see that none of the queued jobs have an estimated start time. These jobs haven't been queuing for too long (30 minutes or so). That is... [root@headnode-1 slurm]# squeue --start JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 12 batchmyjob hpcdev1 PD N/A 1 (null) (Resources) 13 batchmyjob hpcdev1 PD N/A 1 (null) (Resources) 14 batchmyjob hpcdev1 PD N/A 1 (null) (Resources) 15 batchmyjob hpcdev1 PD N/A 1 (null) (Resources) etc Is this what others see or are there any recommended configurations/tips/tricks to make sure that slurm provides estimates? Any advice would be appreciated, please. Best regards, David
Re: [slurm-users] Does Slurm store "time in current state" values anywhere ?
Hi, What about scontrol show job to see various things like: SubmitTime, EligibleTime, AccrueTime etc? David On Thu, Oct 3, 2019 at 4:53 AM Kevin Buckley wrote: > Hi there, > > we're hoping to overcome an issue where some of our users are keen > on writing their own meta-schedulers, so as to try and beat the > actual scheduler, but can't seemingly do as good a job as a scheduler > that's been developed by people who understand scheduling (no real > surprises there!), and so occasionally generate false perceptions of > our systems. > > One of the things our meta-scheduler writers seem unable to allow for, > is jobs that remain in a "completing" state, for whatever reason. > > Whilst we're not looking to provide succour to meta-scheduler writers, > we can see a need for some way to present and/or make use of, a > >"job has been in state S for time T" > > or > >"job entered current state at time T" > > info. > > > Can we access such a value from Slurm: rather, does Slurm keep track > of such a value, whether or not it can currently be accessed on the > "user-side" ? > > > What we're trying to avoid is the need to write a not-quite-Slurm > database that stores such info by continually polling our actual > Slurm database, because we don't think of ourselves as meta-scheduler > writers. > > Here's hoping, > Kevin > > -- > Supercomputing Systems Administrator > Pawsey Supercomputing Centre > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
Re: [slurm-users] Slurm very rarely assigned an estimated start time to a job
We've been working to tune our backfill scheduler here. Here is a presentation some of you might have seen at a previous SLUG on tuning the backfill scheduler. HTH! https://slurm.schedmd.com/SUG14/sched_tutorial.pdf David On Wed, Oct 2, 2019 at 1:37 PM Mark Hahn wrote: > >(most likely in the next year). My reaction is that Slurm very rarely > >provides an estimated start time for a job. I understand that this is not > >possible for jobs on hold and dependent jobs. > > it's also not possible if both running and queued jobs > lack definite termination times; do yours? > > my understanding is the following: > the main scheduler does not perform forward planning. > that is, it is opportunistic. it walks the list of priority-sorted > pending jobs, starting any which can run on currently free > (or preemptable) resources. > > the backfill scheduler is a secondary, asynchronous loop that tries hard > not to interfere with the main scheduler (severely throttles itself) > and tries to place start times for pending jobs. > > the main issue with forward scheduling is that if high-prio jobs become > runnable (submitted, off hold, dependency-satisfied), then most of the > (tentative) start times probably need to be removed. > > a quick look at plugins/sched/backfill/backfill.c indicates that things > are /complicated/ ;) > > we (ComputeCanada) don't see a lot of forward start times either. > > I also would welcome discussion of how to tune the backfill scheduler! > I suspect that in order to work well, it needs a particular distribution > of job priorities. > > regards, mark hahn. > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
[slurm-users] Running job using our serial queue
Hello, We decided to route all jobs requesting from 1 to 20 cores to our serial queue. Furthermore, the nodes controlled by the serial queue are shared by multiple users. We did this to try to reduce the level of fragmentation across the cluster -- our default "batch" queue provides exclusive access to compute nodes. It looks like the downside of the serial queue is that jobs from different users can interact quite badly. To some extent this is an education issue -- for example matlab users need to be told to add the "-singleCompThread" option to their command line. On the other hand I wonder if our cgroups setup is optimal for the serial queue. Our cgroup.conf contains... CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes TaskAffinity=no CgroupMountpoint=/sys/fs/cgroup The relevant cgroup configuration in the slurm.conf is... ProctrackType=proctrack/cgroup TaskPlugin=affinity,cgroup Could someone please advise us on the required/recommended cgroup setup for the above scenario? For example, should we really set "TaskAffinity=yes"? I assume the interaction between jobs (sometimes jobs can get stalled) is due to context switching at the kernel level, however (apart from educating users) how can we minimise that switching on the serial nodes? Best regards, David
Re: [slurm-users] Running job using our serial queue
Hello, Thank you for your replies. I double checked that the "task" in, for example, taskplugin=task/affinity is optional. In this respect it is good to know that we have the correct cgroups setup. So in theory users should only disturb themselves, however in reality we find that there is often a knock on effect on other users' jobs. So, for example, users have complained that their jobs sometimes stall. I can only vaguely think that something odd is going on at the kernel level perhaps. One additional thing that I need to ask is... Should we have hwloc installed our compute nodes? Does that help? Whenever I check which processes are not being constrained by cgroups I only ever find a small group of system processes. Best regards, David From: slurm-users on behalf of Marcus Wagner Sent: 05 November 2019 07:47 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Running job using our serial queue Hi David, doing it the way you do it, is the same way, we do it. When the Matlab job asks for one CPU, it only gets on CPU this way. That means, that all the processes are bound to this one CPU. So (theoretically) the user is just disturbing himself, if he uses more. But especially Matlab, there are more things to do. I t does not suffice to add '-singleCompThread' to the commandline. Matlab is not the only tool, that tries to use all cores, it finds on the node. The same is valid for CPLEX and Gurobi, both often used from Matlab. So even, if the user sets '-singleCompThread' for Matlab, that does not mean at all, the job is only using one CPU. Best Marcus On 11/4/19 4:14 PM, David Baker wrote: Hello, We decided to route all jobs requesting from 1 to 20 cores to our serial queue. Furthermore, the nodes controlled by the serial queue are shared by multiple users. We did this to try to reduce the level of fragmentation across the cluster -- our default "batch" queue provides exclusive access to compute nodes. It looks like the downside of the serial queue is that jobs from different users can interact quite badly. To some extent this is an education issue -- for example matlab users need to be told to add the "-singleCompThread" option to their command line. On the other hand I wonder if our cgroups setup is optimal for the serial queue. Our cgroup.conf contains... CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes TaskAffinity=no CgroupMountpoint=/sys/fs/cgroup The relevant cgroup configuration in the slurm.conf is... ProctrackType=proctrack/cgroup TaskPlugin=affinity,cgroup Could someone please advise us on the required/recommended cgroup setup for the above scenario? For example, should we really set "TaskAffinity=yes"? I assume the interaction between jobs (sometimes jobs can get stalled) is due to context switching at the kernel level, however (apart from educating users) how can we minimise that switching on the serial nodes? Best regards, David -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wag...@itc.rwth-aachen.de<mailto:wag...@itc.rwth-aachen.de> www.itc.rwth-aachen.de<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.itc.rwth-aachen.de&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cf4fb53d4fef74523599b08d761c4ac18%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=dtF928nvXUbXjpc4COy5bB9Qrs9LoZE8ePa26Ydjdsc%3D&reserved=0>
Re: [slurm-users] Running job using our serial queue
Hi Marcus, Thank you for your reply. Your comments regarding the oom_killer sounds interesting. Looking at the slurmd logs on the serial nodes I see that the oom_killer is very active on a typical day, and so I suspect you're likely on to something there. As you might expect memory is configured as a resource on these shared nodes and users should take care to request sufficient memory for their job. More often than none I guess that users are wrongly assuming that the default memory allocation is sufficient. Best regards, David From: Marcus Wagner Sent: 06 November 2019 09:53 To: David Baker ; slurm-users@lists.schedmd.com ; juergen.s...@uni-ulm.de Subject: Re: [slurm-users] Running job using our serial queue Hi David, if I remember right (we have disabled swap for years now), swapping out processes seem to slow down the system overall. But I know, that if the oom_killer does its job (killing over memory processes), the whole system is stalled until it has done its work. This might be the issue, your users see. Hwloc at least should help the scheduler to decide, where to place processes, but if I remember right, slurm has to be built with hwloc support (meaning at least hwloc-devel has to be installed). But this part is more guessing, than knowing. Best Marcus On 11/5/19 11:58 AM, David Baker wrote: Hello, Thank you for your replies. I double checked that the "task" in, for example, taskplugin=task/affinity is optional. In this respect it is good to know that we have the correct cgroups setup. So in theory users should only disturb themselves, however in reality we find that there is often a knock on effect on other users' jobs. So, for example, users have complained that their jobs sometimes stall. I can only vaguely think that something odd is going on at the kernel level perhaps. One additional thing that I need to ask is... Should we have hwloc installed our compute nodes? Does that help? Whenever I check which processes are not being constrained by cgroups I only ever find a small group of system processes. Best regards, David From: slurm-users <mailto:slurm-users-boun...@lists.schedmd.com> on behalf of Marcus Wagner <mailto:wag...@itc.rwth-aachen.de> Sent: 05 November 2019 07:47 To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> <mailto:slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Running job using our serial queue Hi David, doing it the way you do it, is the same way, we do it. When the Matlab job asks for one CPU, it only gets on CPU this way. That means, that all the processes are bound to this one CPU. So (theoretically) the user is just disturbing himself, if he uses more. But especially Matlab, there are more things to do. I t does not suffice to add '-singleCompThread' to the commandline. Matlab is not the only tool, that tries to use all cores, it finds on the node. The same is valid for CPLEX and Gurobi, both often used from Matlab. So even, if the user sets '-singleCompThread' for Matlab, that does not mean at all, the job is only using one CPU. Best Marcus On 11/4/19 4:14 PM, David Baker wrote: Hello, We decided to route all jobs requesting from 1 to 20 cores to our serial queue. Furthermore, the nodes controlled by the serial queue are shared by multiple users. We did this to try to reduce the level of fragmentation across the cluster -- our default "batch" queue provides exclusive access to compute nodes. It looks like the downside of the serial queue is that jobs from different users can interact quite badly. To some extent this is an education issue -- for example matlab users need to be told to add the "-singleCompThread" option to their command line. On the other hand I wonder if our cgroups setup is optimal for the serial queue. Our cgroup.conf contains... CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes TaskAffinity=no CgroupMountpoint=/sys/fs/cgroup The relevant cgroup configuration in the slurm.conf is... ProctrackType=proctrack/cgroup TaskPlugin=affinity,cgroup Could someone please advise us on the required/recommended cgroup setup for the above scenario? For example, should we really set "TaskAffinity=yes"? I assume the interaction between jobs (sometimes jobs can get stalled) is due to context switching at the kernel level, however (apart from educating users) how can we minimise that switching on the serial nodes? Best regards, David -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wag...@itc.rwth-aachen.de<mailto:wag...@itc.rwth-aachen.de> www.itc.rwth-aachen.de<https://eur03
[slurm-users] oom-kill events for no good reason
Hello, We are dealing with some weird issue on our shared nodes where job appear to be stalling for some reason. I was advised that this issue might be related to the oom-killer process. We do see a lot of these events. In fact when I started to take a closer look this afternoon I noticed that all jobs on all nodes (not just the shared nodes) are "firing" the oom-killer for some reason when they finish. As a demo I launched a very simple (low memory usage) test jobs on a shared node and then after a few minutes cancelled it to show the behaviour. Looking in the slurmd.log -- see below -- we see the oom-killer being fired for no good reason. This "feels" vaguely similar to this bug -- https://bugs.schedmd.com/show_bug.cgi?id=5121 which I understand was patched back in SLURM v17 (we are using v18*). Has anyone else seen this behaviour? Or more to the point does anyone understand this behaviour and know how to squash it, please? Best regards, David [2019-11-07T16:14:52.551] Launching batch job 164978 for UID 57337 [2019-11-07T16:14:52.559] [164977.batch] task/cgroup: /slurm/uid_57337/job_164977: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.560] [164977.batch] task/cgroup: /slurm/uid_57337/job_164977/step_batch: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.584] [164978.batch] task/cgroup: /slurm/uid_57337/job_164978: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.584] [164978.batch] task/cgroup: /slurm/uid_57337/job_164978/step_batch: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.960] [164977.batch] task_p_pre_launch: Using sched_affinity for tasks [2019-11-07T16:14:52.960] [164978.batch] task_p_pre_launch: Using sched_affinity for tasks [2019-11-07T16:16:05.859] [164977.batch] error: *** JOB 164977 ON gold57 CANCELLED AT 2019-11-07T16:16:05 *** [2019-11-07T16:16:05.882] [164977.extern] _oom_event_monitor: oom-kill event count: 1 [2019-11-07T16:16:05.886] [164977.extern] done with job
Re: [slurm-users] oom-kill events for no good reason
Hello, Thank you all for your useful replies. I double checked that the oom-killer "fires" at the end of every job on our cluster. As you mention this isn't significant and not something to be concerned about. Best regards, David From: slurm-users on behalf of Marcus Wagner Sent: 08 November 2019 13:00 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] oom-kill events for no good reason Hi David, yes, I see these messages also. I also think, this is more likely a wrong message. If a job has been cancelled by the OOM-Killer, you can see this with sacct, e.g. $> sacct -j 10816098 JobIDJobName PartitionAccount AllocCPUS State ExitCode -- -- -- -- -- 10816098 VASP_MPI c18mdefault 12 OUT_OF_ME+0:125 10816098.ba+ batch default 12 OUT_OF_ME+0:125 10816098.ex+ extern default 12 COMPLETED 0:0 10816098.0 vasp_mpi default 12 OUT_OF_ME+0:125 Best Marcus On 11/7/19 5:36 PM, David Baker wrote: Hello, We are dealing with some weird issue on our shared nodes where job appear to be stalling for some reason. I was advised that this issue might be related to the oom-killer process. We do see a lot of these events. In fact when I started to take a closer look this afternoon I noticed that all jobs on all nodes (not just the shared nodes) are "firing" the oom-killer for some reason when they finish. As a demo I launched a very simple (low memory usage) test jobs on a shared node and then after a few minutes cancelled it to show the behaviour. Looking in the slurmd.log -- see below -- we see the oom-killer being fired for no good reason. This "feels" vaguely similar to this bug -- https://bugs.schedmd.com/show_bug.cgi?id=5121<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D5121&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cb280bfbe58bb495bbace08d7644c9e52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=g%2BT6zIZqTr8ZAi52RgFRaMViwdxZPjkEOkvNa6YEXRU%3D&reserved=0> which I understand was patched back in SLURM v17 (we are using v18*). Has anyone else seen this behaviour? Or more to the point does anyone understand this behaviour and know how to squash it, please? Best regards, David [2019-11-07T16:14:52.551] Launching batch job 164978 for UID 57337 [2019-11-07T16:14:52.559] [164977.batch] task/cgroup: /slurm/uid_57337/job_164977: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.560] [164977.batch] task/cgroup: /slurm/uid_57337/job_164977/step_batch: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.584] [164978.batch] task/cgroup: /slurm/uid_57337/job_164978: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.584] [164978.batch] task/cgroup: /slurm/uid_57337/job_164978/step_batch: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.960] [164977.batch] task_p_pre_launch: Using sched_affinity for tasks [2019-11-07T16:14:52.960] [164978.batch] task_p_pre_launch: Using sched_affinity for tasks [2019-11-07T16:16:05.859] [164977.batch] error: *** JOB 164977 ON gold57 CANCELLED AT 2019-11-07T16:16:05 *** [2019-11-07T16:16:05.882] [164977.extern] _oom_event_monitor: oom-kill event count: 1 [2019-11-07T16:16:05.886] [164977.extern] done with job -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wag...@itc.rwth-aachen.de<mailto:wag...@itc.rwth-aachen.de> www.itc.rwth-aachen.de<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.itc.rwth-aachen.de&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cb280bfbe58bb495bbace08d7644c9e52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=%2Bk3%2BvCTzz%2ByeelQ96SPB5N0EoXCtWp0mrX9pFrUsHHk%3D&reserved=0>
[slurm-users] Longer queuing times for larger jobs
Hello, Our SLURM cluster is relatively small. We have 350 standard compute nodes each with 40 cores. The largest job that users can run on the partition is one requesting 32 nodes. Our cluster is a general university research resource and so there are many different sizes of jobs ranging from single core jobs, that get routed to a serial partition via the job-submit.lua, through to jobs requesting 32 nodes. When we first started the service, 32 node jobs were typically taking in the region of 2 days to schedule -- recently queuing times have started to get out of hand. Our setup is essentially... PriorityFavorSmall=NO FairShareDampeningFactor=5 PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityWeightAge=40 PriorityWeightPartition=1000 PriorityWeightJobSize=50 PriorityWeightQOS=100 PriorityMaxAge=7-0 To try to reduce the queuing times for our bigger jobs should we potentially increase the PriorityWeightJobSize factor in the first instance to bump up the priority of such jobs? Or should we potentially define a set of QOSs which we assign to jobs in our job_submit.lua depending on the size of the job. In other words, let's say there is large QOS that give the largest jobs a higher priority, and also limits how many of those jobs that a single user can submit? Your advice would be appreciated, please. At the moment these large jobs are not accruing a sufficiently high priority to rise above the other jobs in the cluster. Best regards, David
Re: [slurm-users] Longer queuing times for larger jobs
Hello, Thank you for your reply. in answer to Mike's questions... Our serial partition nodes are partially shared by the high memory partition. That is, the partitions overlap partially -- shared nodes move one way or another depending upon demand. Jobs requesting up to and including 20 cores are routed to the serial queue. The serial nodes are shared resources. In other words, jobs from different users can share the nodes. The maximum time for serial jobs is 60 hours. Overtime there hasn't been any particular change in the time that users are requesting. Likewise I'm convinced that the overall job size spread is the same over time. What has changed is the increase in the number of smaller jobs. That is, one node jobs that are exclusive (can't be routed to the serial queue) or that require more then 20 cores, and also jobs requesting up to 10/15 nodes (let's say). The user base has increased dramatically over the last 6 months or so. This over population is leading to the delay in scheduling the larger jobs. Given the size of the cluster we may need to make decisions regarding which types of jobs we allow to "dominate" the system. The larger jobs at the expense of the small fry for example, however that is a difficult decision that means that someone has got to wait longer for results.. Best regards, David From: slurm-users on behalf of Renfro, Michael Sent: 31 January 2020 13:27 To: Slurm User Community List Subject: Re: [slurm-users] Longer queuing times for larger jobs Greetings, fellow general university resource administrator. Couple things come to mind from my experience: 1) does your serial partition share nodes with the other non-serial partitions? 2) what’s your maximum job time allowed, for serial (if the previous answer was “yes”) and non-serial partitions? Are your users submitting particularly longer jobs compared to earlier? 3) are you using the backfill scheduler at all? -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University On Jan 31, 2020, at 6:23 AM, David Baker wrote: Hello, Our SLURM cluster is relatively small. We have 350 standard compute nodes each with 40 cores. The largest job that users can run on the partition is one requesting 32 nodes. Our cluster is a general university research resource and so there are many different sizes of jobs ranging from single core jobs, that get routed to a serial partition via the job-submit.lua, through to jobs requesting 32 nodes. When we first started the service, 32 node jobs were typically taking in the region of 2 days to schedule -- recently queuing times have started to get out of hand. Our setup is essentially... PriorityFavorSmall=NO FairShareDampeningFactor=5 PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityWeightAge=40 PriorityWeightPartition=1000 PriorityWeightJobSize=50 PriorityWeightQOS=100 PriorityMaxAge=7-0 To try to reduce the queuing times for our bigger jobs should we potentially increase the PriorityWeightJobSize factor in the first instance to bump up the priority of such jobs? Or should we potentially define a set of QOSs which we assign to jobs in our job_submit.lua depending on the size of the job. In other words, let's say there is large QOS that give the largest jobs a higher priority, and also limits how many of those jobs that a single user can submit? Your advice would be appreciated, please. At the moment these large jobs are not accruing a sufficiently high priority to rise above the other jobs in the cluster. Best regards, David
Re: [slurm-users] Longer queuing times for larger jobs
Hello, Thank you for your detailed reply. That’s all very useful. I manage to mistype our cluster size since there are actually 450 standard compute, 40 core, compute nodes. What you say is interesting and so it concerns me that things are so bad at the moment, I wondered if you could please give me some more details of how you use TRES to throttle user activity. We have applied some limits to throttle users, however perhaps not enough or not well enough. So the details of what you do would be really appreciated, please. In addition, we do use backfill, however we rarely see nodes being freed up in the cluster to make way for high priority work which again concerns me. If you could please share your backfill configuration then that would be appreciated, please. Finally, which version of Slurm are you running? We are using an early release of v18. Best regards, David From: slurm-users on behalf of Renfro, Michael Sent: 31 January 2020 17:23:05 To: Slurm User Community List Subject: Re: [slurm-users] Longer queuing times for larger jobs I missed reading what size your cluster was at first, but found it on a second read. Our cluster and typical maximum job size scales about the same way, though (our users’ typical job size is anywhere from a few cores up to 10% of our core count). There are several recommendations to separate your priority weights by an order of magnitude or so. Our weights are dominated by fairshare, and we effectively ignore all other factors. We also put TRES limits on by default, so that users can’t queue-stuff beyond a certain limit (any jobs totaling under around 1 cluster-day can be in a running or queued state, and anything past that is ignored until their running jobs burn off some of their time). This allows other users’ jobs to have a chance to run if resources are available, even if they were submitted well after the heavy users’ blocked jobs. We also make extensive use of the backfill scheduler to run small, short jobs earlier than their queue time might allow, if and only if they don’t delay other jobs. If a particularly large job is about to run, we can see the nodes gradually empty out, which opens up lots of capacity for very short jobs. Overall, our average wait times since September 2017 haven’t exceeded 90 hours for any job size, and I’m pretty sure a *lot* of that wait is due to a few heavy users submitting large numbers of jobs far beyond the TRES limit. Even our jobs of 5-10% cluster size have average start times of 60 hours or less (and we've managed under 48 hours for those size jobs for all but 2 months of that period), but those larger jobs tend to be run by our lighter users, and they get a major improvement to their queue time due to being far below their fairshare target. We’ve been running at >50% capacity since May 2018, and >60% capacity since December 2018, and >80% capacity since February 2019. So our wait times aren’t due to having a ton of spare capacity for extended periods of time. Not sure how much of that will help immediately, but it may give you some ideas. > On Jan 31, 2020, at 10:14 AM, David Baker wrote: > > External Email Warning > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > Hello, > > Thank you for your reply. in answer to Mike's questions... > > Our serial partition nodes are partially shared by the high memory partition. > That is, the partitions overlap partially -- shared nodes move one way or > another depending upon demand. Jobs requesting up to and including 20 cores > are routed to the serial queue. The serial nodes are shared resources. In > other words, jobs from different users can share the nodes. The maximum time > for serial jobs is 60 hours. > > Overtime there hasn't been any particular change in the time that users are > requesting. Likewise I'm convinced that the overall job size spread is the > same over time. What has changed is the increase in the number of smaller > jobs. That is, one node jobs that are exclusive (can't be routed to the > serial queue) or that require more then 20 cores, and also jobs requesting up > to 10/15 nodes (let's say). The user base has increased dramatically over the > last 6 months or so. > > This over population is leading to the delay in scheduling the larger jobs. > Given the size of the cluster we may need to make decisions regarding which > types of jobs we allow to "dominate" the system. The larger jobs at the > expense of the small fry for example, however that is a difficult decision > that means that someone has got to wait longer for results.. > > Best regards, > David > From: slurm-users on behalf of > Renfro, Michael > Sent: 31 Janu
Re: [slurm-users] Longer queuing times for larger jobs
Hello, Thank you very much again for your comments and the details of your slurm configuration. All the information is really useful. We are working on our cluster right now and making some appropriate changes. We'll see how we get on over the next 24 hours or so. Best regards, David From: slurm-users on behalf of Renfro, Michael Sent: 31 January 2020 22:08 To: Slurm User Community List Subject: Re: [slurm-users] Longer queuing times for larger jobs Slurm 19.05 now, though all these settings were in effect on 17.02 until quite recently. If I get some detail wrong below, I hope someone will correct me. But this is our current working state. We’ve been able to schedule 10-20k jobs per month since late 2017, and we successfully scheduled 320k jobs over December and January (largely due to one user using some form of automated submission for very short jobs). Basic scheduler setup: As I’d said previously, we prioritize on fairshare almost exclusively. Most of our jobs (molecular dynamics, CFD) end up in a single batch partition, since GPU and big-memory jobs have other partitions. SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory PriorityType=priority/multifactor PriorityDecayHalfLife=14-0 PriorityWeightFairshare=10 PriorityWeightAge=1000 PriorityWeightPartition=1 PriorityWeightJobSize=1000 PriorityMaxAge=1-0 TRES limits: We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser set grptresrunmin=cpu=144 — there might be a way of doing this at a higher accounting level, but it works as is. We also force QoS=gpu in each GPU partition’s definition in slurm.conf, and set MaxJobsPerUser equal to our total GPU count. That helps prevent users from queue-stuffing the GPUs even if they stay well below the 1000 CPU-day TRES limit above. Backfill: SchedulerType=sched/backfill SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200 Can’t remember where I found the backfill guidance, but: - bf_window is set to our maximum job length (30 days) and bf_resolution is set to 1.5 days. Most of our users’ jobs are well over 1 day. - We have had users who didn’t use job arrays, and submitted a ton of small jobs at once, thus bf_max_job_user gives the scheduler a chance to start up to 80 jobs per user each cycle. This also prompted us to increase default_queue_depth, so the backfill scheduler would examine more jobs each cycle. - bf_continue should let the backfill scheduler continue where it left off if it gets interrupted, instead of having to start from scratch each time. I can guarantee you that our backfilling was sub-par until we tuned these parameters (or at least a few users could find a way to submit so many jobs that the backfill couldn’t keep up, even when we had idle resources for their very short jobs). > On Jan 31, 2020, at 3:01 PM, David Baker wrote: > > External Email Warning > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > Hello, > > Thank you for your detailed reply. That’s all very useful. I manage to > mistype our cluster size since there are actually 450 standard compute, 40 > core, compute nodes. What you say is interesting and so it concerns me that > things are so bad at the moment, > > I wondered if you could please give me some more details of how you use TRES > to throttle user activity. We have applied some limits to throttle users, > however perhaps not enough or not well enough. So the details of what you do > would be really appreciated, please. > > In addition, we do use backfill, however we rarely see nodes being freed up > in the cluster to make way for high priority work which again concerns me. If > you could please share your backfill configuration then that would be > appreciated, please. > > Finally, which version of Slurm are you running? We are using an early > release of v18. > > Best regards, > David > > From: slurm-users on behalf of > Renfro, Michael > Sent: 31 January 2020 17:23:05 > To: Slurm User Community List > Subject: Re: [slurm-users] Longer queuing times for larger jobs > > I missed reading what size your cluster was at first, but found it on a > second read. Our cluster and typical maximum job size scales about the same > way, though (our users’ typical job size is anywhere from a few cores up to > 10% of our core count). > > There are several recommendations to separate your priority weights by an > order of magnitude or so. Our weights are dominated by fairshare, and we > effectively ignore all other factors. > > We also put TRES limits on by default, so that users can’t queue-stuff beyond > a certain limit (any jobs totaling under around 1 cluster-
Re: [slurm-users] Longer queuing times for larger jobs
Hello, I've taken a very good look at our cluster, however as yet not made any significant changes. The one change that I did make was to increase the "jobsizeweight". That's now our dominant parameter and it does ensure that our largest jobs (> 20 nodes) are making it to the top of the sprio listing which is what we want to see. These large jobs aren't making an progress despite the priority lift. I additionally decreased the nice value of the job that sparked this discussion. That is (looking at at sprio) there is a 32 node job with a very high priority... JOBID PARTITION USER PRIORITYAGE FAIRSHAREJOBSIZE PARTITION QOSNICE 280919 batch mep1c101275481 40 59827 415655 0 0 -40 That job has been sitting in the queue for well over a week and it is disconcerting that we never see nodes becoming idle in order to service these large jobs. Nodes do become idle and then get scooped by jobs started by backfill. Looking at the slurmctld logs I see that the vast majority of jobs are being started via backfill -- including, for example, a 24 node job. I see very few jobs allocated by the scheduler. That is, messages like sched: Allocate JobId=296915 are few and far between and I never see any of the large jobs being allocated in the batch queue. Surely, this is not correct, however does anyone have any advice on what to check, please? Best regards, David From: slurm-users on behalf of Killian Murphy Sent: 04 February 2020 10:48 To: Slurm User Community List Subject: Re: [slurm-users] Longer queuing times for larger jobs Hi David. I'd love to hear back about the changes that you make and how they affect the performance of your scheduler. Any chance you could let us know how things go? Killian On Tue, 4 Feb 2020 at 10:43, David Baker mailto:d.j.ba...@soton.ac.uk>> wrote: Hello, Thank you very much again for your comments and the details of your slurm configuration. All the information is really useful. We are working on our cluster right now and making some appropriate changes. We'll see how we get on over the next 24 hours or so. Best regards, David From: slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Renfro, Michael mailto:ren...@tntech.edu>> Sent: 31 January 2020 22:08 To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] Longer queuing times for larger jobs Slurm 19.05 now, though all these settings were in effect on 17.02 until quite recently. If I get some detail wrong below, I hope someone will correct me. But this is our current working state. We’ve been able to schedule 10-20k jobs per month since late 2017, and we successfully scheduled 320k jobs over December and January (largely due to one user using some form of automated submission for very short jobs). Basic scheduler setup: As I’d said previously, we prioritize on fairshare almost exclusively. Most of our jobs (molecular dynamics, CFD) end up in a single batch partition, since GPU and big-memory jobs have other partitions. SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory PriorityType=priority/multifactor PriorityDecayHalfLife=14-0 PriorityWeightFairshare=10 PriorityWeightAge=1000 PriorityWeightPartition=1 PriorityWeightJobSize=1000 PriorityMaxAge=1-0 TRES limits: We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser set grptresrunmin=cpu=144 — there might be a way of doing this at a higher accounting level, but it works as is. We also force QoS=gpu in each GPU partition’s definition in slurm.conf, and set MaxJobsPerUser equal to our total GPU count. That helps prevent users from queue-stuffing the GPUs even if they stay well below the 1000 CPU-day TRES limit above. Backfill: SchedulerType=sched/backfill SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200 Can’t remember where I found the backfill guidance, but: - bf_window is set to our maximum job length (30 days) and bf_resolution is set to 1.5 days. Most of our users’ jobs are well over 1 day. - We have had users who didn’t use job arrays, and submitted a ton of small jobs at once, thus bf_max_job_user gives the scheduler a chance to start up to 80 jobs per user each cycle. This also prompted us to increase default_queue_depth, so the backfill scheduler would examine more jobs each cycle. - bf_continue should let the backfill scheduler continue where it left off if it gets interrupted, instead of having to start from scratch each time. I can guarantee you that our backfilling was sub-par until we tuned these parameters (or at least a few users could find a way to submit so many jobs that the backfill couldn’t keep up, even when we had
[slurm-users] Advice on using GrpTRESRunMin=cpu=
Hello, Before implementing "GrpTRESRunMin=cpu=limit" on our production cluster I'm doing some tests on the development cluster. I've only get a handful of compute nodes to play without and so I have set the limit sensibly low. That is, I've set the limit to be 576,000. That's equivalent to 400 CPU-days. In other words, I can potentially submit the following job... 1 x 2 nodes x 80 cpus/node x 2.5 days = 400 CPU-days I submitted a set of jobs requesting 2 nodes, 80 cpus/node for 2.5 days. The first day is running and the rest are in the queue -- what I see makes sense... JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 677 debugmyjob djb1 PD 0:00 2 (AssocGrpCPURunMinutesLimit) 678 debugmyjob djb1 PD 0:00 2 (AssocGrpCPURunMinutesLimit) 679 debugmyjob djb1 PD 0:00 2 (AssocGrpCPURunMinutesLimit) 676 debugmyjob djb1 R 12:52 2 navy[54-55] On the other hand, I expected these jobs not to accrue priority, however they do appear to be (see sprio below). I'm working with Slurm v19.05.2. Have I missed something vital/important in the config? We hoped that the queued jobs would not accrue priority. We haven't, for example, used "accrue always". Have I got that wrong? Could someone please advise us. Best regards, David [root@navy51 slurm]# sprio JOBID PARTITION PRIORITY SITEAGE FAIRSHARE JOBSIZEQOS 677 debug5551643 10 1644 45 500 0 678 debug5551643 10 1644 45 500 0 679 debug5551642 10 1643 45 500 0
Re: [slurm-users] Job with srun is still RUNNING after node reboot
Hi, Yair, Out of curiosity have you checked to see if this is a runaway job? David On Tue, Mar 31, 2020 at 7:49 AM Yair Yarom wrote: > Hi, > > We have an issue where running srun (with --pty zsh), and rebooting the > node (from a different shell), the srun reports: > srun: error: eio_message_socket_accept: slurm_receive_msg[an.ip.addr.ess]: > Zero Bytes were transmitted or received > and hangs. > > After the node boots, the slurm claims that job is still RUNNING, and srun > is still alive (but not responsive). > > I've tried it with various configurations (select/linear, > select/cons_tres, jobacct_gather/linux, jobacct_gather/cgroup, task/none, > task/cgroup), with the same results. We're using 19.05.1. > Running with sbatch causes the job to be in the more appropriate NODE_FAIL > state instead. > > Anyone else encountered this? or know how to make the job state not > RUNNING after it's clearly not running? > > Thanks in advance, > Yair. > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
Re: [slurm-users] Drain a single user's jobs
Hi Mark, I *think* you might need to update the user account to have access to that QoS (as part of their association). Using sacctmgr modify user + some additional args (they escape me at the moment). Also, you *might* have been able to set the MaxSubmitJobs at their account level to 0 and have them run without having to do the QoS approach - but that's just a guess on my end based on how we've done some things here. We had a "free period" for our clusters and once it was over we set the GrpSubmit jobs on an account to 0 which allowed in-flight jobs to continue but no new work to be submitted. HTH, David On Wed, Apr 1, 2020 at 5:57 AM Mark Dixon wrote: > Hi all, > > I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster. > > I'd like to stop user foo from submitting new jobs but allow their > existing jobs to run. > > We have several partitions, each with its own qos and MaxSubmitJobs > typically set to some vaue. These qos are stopping a "sacctmgr update user > foo set maxsubmitjobs=0" from doing anything useful, as per the > documentation. > > I've tried setting up a competing qos: > >sacctmgr add qos drain >sacctmgr modify qos drain set MaxSubmitJobs=0 >sacctmgr modify qos drain set flags=OverPartQOS >sacctmgr modify user foo set qos=drain > > This has successfully prevented the user from submitting new jobs, but > their existing jobs aren't running. I'm seeing the reason code > "InvalidQOS". > > Any ideas what I should be looking at, please? > > Thanks, > > Mark > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan
[slurm-users] Slurm unlink error messages -- what do they mean?
Hello, We have, rather belatedly, just upgraded to Slurm v19.05.5. On the whole, so far so good -- no major problems. One user has complained that his job now crashes and reports an unlink error. That is.. slurmstepd: error: get_exit_code task 0 died by signal: 9 slurmstepd: error: unlink(/tmp/slurmd/job392987/slurm_script): No such file or directory I suspect that this message has something to do with the completion of one of the steps in his job. Apparently his job is quite complex with a number of inter-related tasks. Significantly, we decided to switch from an rpm to a 'build from source' installation. In other words, we did have rpms on each node in the cluster, but now have slurm installed on a global file system. Does anyone have any thoughts regarding the above issue, please? I'm still to see the user's script and so there might be a good logical explanation for the message on inspection. Best regards, David
Re: [slurm-users] Job Step Resource Requests are Ignored
i'm not sure I understand the problem. If you want to make sure the preamble and postamble run even if the main job doesn't run you can use '-d' from the man page -d, --dependency= Defer the start of this job until the specified dependencies have been satisfied completed. is of the form or . All dependencies must be satisfied if the "," separator is used. Any dependency may be satisfied if the "?" separator is used. Many jobs can share the same dependency and these jobs may even belong to different users. The value may be changed after job submission using the scontrol command. Once a job dependency fails due to the termination state of a preceding job, the dependent job will never be run, even if the preceding job is requeued and has a different termination state in a subsequent execution. for instance, create a job that contains this: preamble_id=`sbatch preamble.job` main_id=`sbatch -d afterok:$preamble_id main.job` sbatch -d afterany:$main_id postamble.job Best, D On Wed, May 6, 2020 at 2:19 PM Maria Semple wrote: > Hi Chris, > > I think my question isn't quite clear, but I'm also pretty confident the > answer is no at this point. The idea is that the script is sort of like a > template for running a job, and an end user can submit a custom job with > their own desired resource requests which will end up filling in the > template. I'm not in control of the Slurm cluster that will ultimately run > the job, nor the details of the job itself. For example, template-job.sh > might look like this: > > #!/bin/bash > srun -c 1 --mem=1k echo "Preamble" > srun -c --mem=m /bin/sh -c > srun -c 1 --mem=1k echo "Postamble" > > My goal is that even if the user requests 10 CPUs when the cluster only > has 4 available, the Preamble and Postamble steps will always run. But as I > said, it seems like that's not possible since the maximum number of CPUs > needs to be set on the sbatch allocation and the whole job would be > rejected on the basis that too many CPUs were requested. Is that correct? > > On Tue, May 5, 2020, 11:13 PM Chris Samuel wrote: > >> On Tuesday, 5 May 2020 11:00:27 PM PDT Maria Semple wrote: >> >> > Is there no way to achieve what I want then? I'd like the first and >> last job >> > steps to always be able to run, even if the second step needs too many >> > resources (based on the cluster). >> >> That should just work. >> >> #!/bin/bash >> #SBATCH -c 2 >> #SBATCH -n 1 >> >> srun -c 1 echo hello >> srun -c 4 echo big wide >> srun -c 1 echo world >> >> gives: >> >> hello >> srun: Job step's --cpus-per-task value exceeds that of job (4 > 2). Job >> step >> may never run. >> srun: error: Unable to create step for job 604659: More processors >> requested >> than permitted >> world >> >> > As a side note, do you know why it's not even possible to restrict the >> > number of resources a single step uses (i.e. set less CPUs than are >> > available to the full job)? >> >> My suspicion is that you've not set up Slurm to use cgroups to restrict >> the >> resources a job can use to just those requested. >> >> https://slurm.schedmd.com/cgroups.html >> >> All the best, >> Chris >> -- >> Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA >> >> >> >> >>
Re: [slurm-users] How to view GPU indices of the completed jobs?
Hi Kota, This is from the job template that I give to my users: # Collect some information about the execution environment that may # be useful should we need to do some debugging. echo "CREATING DEBUG DIRECTORY" echo mkdir .debug_info module list > .debug_info/environ_modules 2>&1 ulimit -a > .debug_info/limits 2>&1 hostname > .debug_info/environ_hostname 2>&1 env |grep SLURM > .debug_info/environ_slurm 2>&1 env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1 env |grep OMPI > .debug_info/environ_openmpi 2>&1 env > .debug_info/environ 2>&1 if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then echo "SAVING CUDA ENVIRONMENT" echo env |grep CUDA > .debug_info/environ_cuda 2>&1 fi You could add something like this to one of the SLURM prologs to save the GPU list of jobs. Best, David On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki < kota.tsuyuzaki...@hco.ntt.co.jp> wrote: > Hello Guys, > > We are running GPU clusters with Slurm and SlurmDBD (version 19.05 series) > and some of GPUs seemed to get troubles for attached > jobs. To investigate if the troubles happened on the same GPUs, I'd like > to get GPU indices of the completed jobs. > > In my understanding `scontrol show job` can show the indices (as IDX in > gres info) but cannot be used for completed job. And also > `sacct -j` is available for complete jobs but won't print the indices. > > Is there any way (commands, configurations, etc...) to see the allocated > GPU indices for completed jobs? > > Best regards, > > > 露崎 浩太 (Kota Tsuyuzaki) > kota.tsuyuzaki...@hco.ntt.co.jp > NTTソフトウェアイノベーションセンタ > 分散処理基盤技術プロジェクト > 0422-59-2837 > - > > > > > >
[slurm-users] Nodes do not return to service after scontrol reboot
Hello, We are running Slurm v19.05.5 and I am experimenting with the scontrol reboot command. I find that compute nodes reboot, but they are not returned to service. Rather they remain down following the reboot.. navy55 1debug*down 80 2:20:2 1920000 2000 (null) Reboot ASAP : reboot This is a diskfull node and so it doesn't take too long to reboot. For the sake of the argument I have set ResumeTimeOut to 1000 seconds which is well over what's needed... [root@navy51 slurm]# grep -i resume slurm.conf ResumeTimeout=1000 [root@navy51 slurm]# grep -i return slurm.conf ReturnToService=0 [root@navy51 slurm]# grep -i nhc slurm.conf # LBNL Node Health Check (NHC) #HealthCheckProgram=/usr/sbin/nhc For this experiment I have disabled the health checker, and I don't think setting ReturnToService=1 helps. Could anyone please help with this? We are about to update the node firmware and ensuring that the nodes are returned to service following their reboot would be useful. Best regards, David
Re: [slurm-users] Nodes do not return to service after scontrol reboot
Hello Chris, Thank you for your comments. The scontrol reboot command is now working as expected. Best regards, David From: slurm-users on behalf of Christopher Samuel Sent: 16 June 2020 18:16 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Nodes do not return to service after scontrol reboot On 6/16/20 8:16 am, David Baker wrote: > We are running Slurm v19.05.5 and I am experimenting with the *scontrol > reboot * command. I find that compute nodes reboot, but they are not > returned to service. Rather they remain down following the reboot.. How are you using "scontrol reboot" ? We do: scontrol reboot ASAP nextstate=resume reason=$REASON $NODE Which works for us (and we have health checks in our epilog that can trigger this for known issues like running low on unfragmented huge pages). All the best, Chris -- Chris Samuel : https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C6fa4d9db3b0e47f6a03308d812197d60%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=V9%2Fytt3ActVODtPjD%2FXAB2w5TvVhSJDYJ9%2B0xUmJRUU%3D&reserved=0 : Berkeley, CA, USA
[slurm-users] Slurm and shared file systems
Hello, We are currently helping a research group to set up their own Slurm cluster. They have asked a very interesting question about Slurm and file systems. That is, they are posing the question -- do you need a shared user file store on a Slurm cluster? So, in the extreme case where this is no shared file store for users can slurm operate properly over a cluster? I have seen commands like sbcast to move a file from the submission node to a compute node, however that command can only transfer one file at a time. Furthermore what would happen to the standard output files? I'm going to guess that there must be a shared file system, however it would be good if someone could please confirm this. Best regards, David
[slurm-users] Slurm -- using GPU cards with NVLINK
Hello, We are installing a group of nodes which all contain 4 GPU cards. The GPUs are paired together using NVLINK as described in the matrix below. We are familiar with using Slurm to schedule and run jobs on GPU cards, but this is the first time we have dealt with NVLINK enabled GPUs. Could someone please advise us how to configure Slurm so that we can submit jobs to the cards and make use of the NVLINK? That is, what do we need to put in the gres.conf or slurm.conf, and how should users use the sbatch command? I presume, for example, that a user could make use of a GPU card, and potentially make use of memory on the paired card. Best regards, David [root@alpha51 ~]# nvidia-smi topo --matrix GPU0GPU1GPU2GPU3CPU AffinityNUMA Affinity GPU0 X NV2 SYS SYS 0,2,4,6,8,100 GPU1NV2 X SYS SYS 0,2,4,6,8,100 GPU2SYS SYS X NV2 1,3,5,7,9,111 GPU3SYS SYS NV2 X 1,3,5,7,9,111
Re: [slurm-users] Slurm -- using GPU cards with NVLINK
Hi Ryan, Thank you very much for your reply. That is useful. We'll see how we get on. Best regards, David From: slurm-users on behalf of Ryan Novosielski Sent: 11 September 2020 00:08 To: Slurm User Community List Subject: Re: [slurm-users] Slurm -- using GPU cards with NVLINK I’m fairly sure that you set this up the same way you set up for a peer-to-peer setup. Here’s ours: [root@cuda001 ~]# nvidia-smi topo --matrix GPU0GPU1GPU2GPU3mlx4_0 CPU Affinity GPU0 X PIX SYS SYS PHB 0-11 GPU1PIX X SYS SYS PHB 0-11 GPU2SYS SYS X PIX SYS 12-23 GPU3SYS SYS PIX X SYS 12-23 mlx4_0 PHB PHB SYS SYS X [root@cuda001 ~]# cat /etc/slurm/gres.conf … # 2 x K80 (perceval) NodeName=cuda[001-008] Name=gpu File=/dev/nvidia[0-1] CPUs=0-11 NodeName=cuda[001-008] Name=gpu File=/dev/nvidia[2-3] CPUs=12-23 This also seems to be related: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2FSLUG19%2FGPU_Scheduling_and_Cons_Tres.pdf&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C1a052163da5d4d0643d808d855ded053%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=lV2AExQxAc7svAT2FNJHJ8TsU5pfix0GwjpQ29Cc%2B0A%3D&reserved=0 -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Sep 10, 2020, at 11:00 AM, David Baker wrote: > > Hello, > > We are installing a group of nodes which all contain 4 GPU cards. The GPUs > are paired together using NVLINK as described in the matrix below. > > We are familiar with using Slurm to schedule and run jobs on GPU cards, but > this is the first time we have dealt with NVLINK enabled GPUs. Could someone > please advise us how to configure Slurm so that we can submit jobs to the > cards and make use of the NVLINK? That is, what do we need to put in the > gres.conf or slurm.conf, and how should users use the sbatch command? I > presume, for example, that a user could make use of a GPU card, and > potentially make use of memory on the paired card. > > Best regards, > David > > [root@alpha51 ~]# nvidia-smi topo --matrix > GPU0GPU1GPU2GPU3CPU AffinityNUMA Affinity > GPU0 X NV2 SYS SYS 0,2,4,6,8,100 > GPU1NV2 X SYS SYS 0,2,4,6,8,100 > GPU2SYS SYS X NV2 1,3,5,7,9,111 > GPU3SYS SYS NV2 X 1,3,5,7,9,111
[slurm-users] Accounts and QOS settings
Hello, I wondered if someone would be able to advise me on how to limit access to a group of resources, please. We have just installed a set of 6 GPU nodes. These nodes belong to a research department and both staff and students will potentially need access to the nodes. I need to ensure that only these two groups of users have access to the nodes. The general public should not have access to the resources. Access to the nodes is a 50/50 split between the two groups, and staff should be able to run much longer jobs than students. Those are the constraints. How can I do the above? I assume I put the users into two account groups -- staff and students, for example. Then I could use the groups to limit access to the partition. How do I best use a QOS to limit the number of nodes used/group and the walltime allowed? Should/can I apply a QOS to the account group, or the partition. My thought was to have two overlapping partitions each with the relevant QOS and account group access control. Perhaps I am making this too complicated. I would appreciate your advice, please. Best regards, David
[slurm-users] Controlling access to idle nodes
Hello, I would appreciate your advice on how to deal with this situation in Slurm, please. If I have a set of nodes used by 2 groups, and normally each group would each have access to half the nodes. So, I could limit each group to have access to 3 nodes each, for example. I am trying to devise a scheme that allows each group to make best use of the node always. In other words, each group could potentially use all the nodes (assuming they all free and the other group isn't using the nodes at all). I cannot set hard and soft limits in slurm, and so I'm not sure how to make the situation flexible. Ideally It would be good for each group to be able to use their allocation and then take advantage of any idle nodes via a scavenging mechanism. The other group could then pre-empt the scavenger jobs and claim their nodes. I'm struggling with this since this seems like a two-way scavenger situation. Could anyone please help? I have, by the way, set up partition-based pre-emption in the cluster. This allows the general public to scavenge nodes owned by research groups. Best regards, David
[slurm-users] unable to run on all the logical cores
Hi, my Slurm cluster has a dozen machines configured as follows: NodeName=foobar01 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=257243 State=UNKNOWN and scheduling is: # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core My problem is that only half of the logical cores are used when I run a computation. Let me explain: I use R and the package 'batchtools' to create jobs. All the jobs are created under the hood with sbatch. If I log in to all the machines in my cluster and do a 'htop', I can see that only half of the logical cores are used. Other methods to measure the load of each machine confirmed this "visual" clue. My jobs ask Slurm for only one cpu per task. I tried to enforce that with the -c 1 but it didn't make any difference. Then I realized there was something strange: when I do scontrol show job , I can spot the following output: NumNodes=1 NumCPUs=2 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:2 CoreSpec=* that is each job uses NumCPUs=2 instead of 1. Also, I'm not sure why TRES=cpu=2 Any idea on how to solve this problem and have 100% of the logical cores allocated? Best regards, David
Re: [slurm-users] unable to run on all the logical cores
Hi Rodrigo, good spot. At least, scontrol show job is now saying that each job only requires one "CPU", so it seems all the cores are treated the same way now. Though I still have the problem of not using more than half the cores. So I suppose it might be due to the way I submit (batchtools in this case) the jobs. I'm still investigating even if NumCPUs=1 now as it should be. Thanks. David On Thu, Oct 8, 2020 at 4:40 PM Rodrigo Santibáñez < rsantibanez.uch...@gmail.com> wrote: > Hi David, > > I had the same problem time ago when configuring my first server. > > Could you try SelectTypeParameters=CR_CPU instead of > SelectTypeParameters=CR_Core? > > Best regards, > Rodrigo. > > On Thu, Oct 8, 2020, 02:16 David Bellot > wrote: > >> Hi, >> >> my Slurm cluster has a dozen machines configured as follows: >> >> NodeName=foobar01 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 >> ThreadsPerCore=2 RealMemory=257243 State=UNKNOWN >> >> and scheduling is: >> >> # SCHEDULING >> SchedulerType=sched/backfill >> SelectType=select/cons_tres >> SelectTypeParameters=CR_Core >> >> My problem is that only half of the logical cores are used when I run a >> computation. >> >> Let me explain: I use R and the package 'batchtools' to create jobs. All >> the jobs are created under the hood with sbatch. If I log in to all the >> machines in my cluster and do a 'htop', I can see that only half of the >> logical cores are used. Other methods to measure the load of each machine >> confirmed this "visual" clue. >> My jobs ask Slurm for only one cpu per task. I tried to enforce that with >> the -c 1 but it didn't make any difference. >> >> Then I realized there was something strange: >> when I do scontrol show job , I can spot the following output: >> >>NumNodes=1 NumCPUs=2 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >>TRES=cpu=2,node=1,billing=2 >>Socks/Node=* NtasksPerN:B:S:C=0:0:*:2 CoreSpec=* >> >> that is each job uses NumCPUs=2 instead of 1. Also, I'm not sure why >> TRES=cpu=2 >> >> Any idea on how to solve this problem and have 100% of the logical cores >> allocated? >> >> Best regards, >> David >> > -- <https://www.lifetrading.com.au/> David Bellot Head of Quantitative Research A. Suite B, Level 3A, 43-45 East Esplanade, Manly, NSW 2095 E. david.bel...@lifetrading.com.au P. (+61) 0405 263012
Re: [slurm-users] Controlling access to idle nodes
Thank you very much for your comments. Oddly enough, I came up with the 3-partition model as well once I'd sent my email. So, your comments helped to confirm that I was thinking on the right lines. Best regards, David From: slurm-users on behalf of Thomas M. Payerle Sent: 06 October 2020 18:50 To: Slurm User Community List Subject: Re: [slurm-users] Controlling access to idle nodes We use a scavenger partition, and although we do not have the policy you describe, it could be used in your case. Assume you have 6 nodes (node-[0-5]) and two groups A and B. Create partitions partA = node-[0-2] partB = node-[3-5] all = node-[0-6] Create QoSes normal and scavenger. Allow normal QoS to preempt jobs with scavenger QoS In sacctmgr, give members of group A access to use partA with normal QoS and group B access to use partB with normal QoS Allow both A and B to use part all with scavenger QoS. So members of A can launch jobs on partA with normal QoS (probably want to make that their default), and similarly member of B can launch jobs on partB with normal QoS. But membes of A can also launch jobs on partB with scavenger QoS and vica versa. If the partB nodes used by A are needed by B, they will get preempted. This is not automatic (users need to explicitly say they want to run jobs on the other half of the cluster), but that is probably reasonable because there are some jobs one does not wish to get preempted even if they have to wait a while in the queue to ensure such. On Tue, Oct 6, 2020 at 11:12 AM David Baker mailto:d.j.ba...@soton.ac.uk>> wrote: Hello, I would appreciate your advice on how to deal with this situation in Slurm, please. If I have a set of nodes used by 2 groups, and normally each group would each have access to half the nodes. So, I could limit each group to have access to 3 nodes each, for example. I am trying to devise a scheme that allows each group to make best use of the node always. In other words, each group could potentially use all the nodes (assuming they all free and the other group isn't using the nodes at all). I cannot set hard and soft limits in slurm, and so I'm not sure how to make the situation flexible. Ideally It would be good for each group to be able to use their allocation and then take advantage of any idle nodes via a scavenging mechanism. The other group could then pre-empt the scavenger jobs and claim their nodes. I'm struggling with this since this seems like a two-way scavenger situation. Could anyone please help? I have, by the way, set up partition-based pre-emption in the cluster. This allows the general public to scavenge nodes owned by research groups. Best regards, David -- Tom Payerle DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu<mailto:paye...@umd.edu> 5825 University Research Park (301) 405-6135 University of Maryland College Park, MD 20740-3831
Re: [slurm-users] unable to run on all the logical cores
Indeed, it makes sense now. However, if I launch many R processes using the "parallel" package, I can easily have all the "logical" cores running. In the background, if I'm correct ,R will "fork" and not create a thread. So we have independent processes. On a 20 cores CPU for example, I have 40 "logical" cores and all the cores are running, according to htop. With Slurm, I can't reproduce the same behavior even if I use the SelectTypeParameters=CR_CPU. So, is there a config to tune, an option to use in "sbatch" to achieve the same result, or should I rather launch 20 jobs per node and have each job split in two internally (using "parallel" or "future" for example)? On Thu, Oct 8, 2020 at 6:32 PM William Brown wrote: > R is single threaded. > > On Thu, 8 Oct 2020, 07:44 Diego Zuccato, wrote: > >> Il 08/10/20 08:19, David Bellot ha scritto: >> >> > good spot. At least, scontrol show job is now saying that each job only >> > requires one "CPU", so it seems all the cores are treated the same way >> now. >> > Though I still have the problem of not using more than half the cores. >> > So I suppose it might be due to the way I submit (batchtools in this >> > case) the jobs. >> Maybe R is generating single-threaded code? In that case, only a single >> process can run on a given core at a time (processes does not share >> memory map, threads do, and on Intel CPUs there's a single MMU per core, >> not one per thread as in some AMDs). >> >> -- >> Diego Zuccato >> DIFA - Dip. di Fisica e Astronomia >> Servizi Informatici >> Alma Mater Studiorum - Università di Bologna >> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy >> tel.: +39 051 20 95786 >> >> -- <https://www.lifetrading.com.au/> David Bellot Head of Quantitative Research A. Suite B, Level 3A, 43-45 East Esplanade, Manly, NSW 2095 E. david.bel...@lifetrading.com.au P. (+61) 0405 263012
[slurm-users] ninja and cmake
Hi, I installed a cluster with 10 nodes and I'd like to try compiling a very large code base using all the nodes. The context is as follows: - my code base is in C++, I use gcc. - configuration is done with CMake - compilation is processed by ninja (something similar to make) I can srun ninja and get the code base compiled on another node using as many cores as I want on the other node. Now what I want to do is to have each file being compiled as a single Slurm job, so that I can spread my compilation over all the nodes of the cluster and not just on one machine. I know that ccache and distcc exist and I use them, but here I want to test if it's possible to do it with Slurm (as a proof of concept). Cheers, David
[slurm-users] Backfill pushing jobs back
Hello, We see the following issue with smaller jobs pushing back large jobs. We are using slurm 19.05.8 so not sure if this is patched in newer releases. With a 4 node test partition I submit 3 jobs as 2 users ssh hpcdev1@navy51 'sbatch --nodes=3 --ntasks-per-node=40 --partition=backfilltest --time=120 --wrap="sleep 7200"' ssh hpcdev2@navy51 'sbatch --nodes=4 --ntasks-per-node=40 --partition=backfilltest --time=60 --wrap="sleep 3600"' ssh hpcdev2@navy51 'sbatch --nodes=4 --ntasks-per-node=40 --partition=backfilltest --time=60 --wrap="sleep 3600"' Then I increase the priority of the pending jobs significantly. Reading the manual, my understanding is that nodes job should be held for these jobs. for job in $(squeue -h -p backfilltest -t pd -o %i); do scontrol update job ${job} priority=10;done squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T" JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE 28482 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING 28483 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING 28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING So, there is one node free in our 4 node partition. Naturally, a small job with a walltime of less than 1 hour could run in that but we are also seeing backfill start longer jobs. backfilltestup 2-12:00:00 3 alloc reddev[001-003] backfilltestup 2-12:00:00 1 idle reddev004 ssh hpcdev3@navy51 'sbatch --nodes=1 --ntasks-per-node=40 --partition=backfilltest --time=720 --wrap="sleep 432000"' squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T" JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE 28482 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING 28483 | hpcdev2 | 160 | 10 | 1:00:00 | N/A | PENDING 28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING 28484 | hpcdev3 | 40 | 37541 | 12:00:00 | 2020-12-08T09:54:48 | RUNNING Is this expect behaviour? It is also weird that the pending jobs don't have a start time. I have increased the backfill parameters significantly, but it doesn't seem to affect this at all. SchedulerParameters=bf_window=14400,bf_resolution=2400,bf_max_job_user=80,bf_continue,default_queue_depth=1000,bf_interval=60 Best regards, David
Re: [slurm-users] Backfill pushing jobs back
Hi Chris, Thank you for your reply. It isn't long since we upgraded to Slurm v19, however it sounds like we should start to actively look at v20 since this issue is causing significant problems on our cluster. We're download and install v20 on our dev cluster, and experiment. Best regards, David From: slurm-users on behalf of Chris Samuel Sent: 09 December 2020 16:37 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Backfill pushing jobs back CAUTION: This e-mail originated outside the University of Southampton. Hi David, On 9/12/20 3:35 am, David Baker wrote: > We see the following issue with smaller jobs pushing back large jobs. We > are using slurm 19.05.8 so not sure if this is patched in newer releases. This sounds like a problem that we had at NERSC (small jobs pushing back multi-thousand node jobs), and we carried a local patch for which Doug managed to get upstreamed in 20.02.x (I think it landed in 20.02.3, but 20.02.6 is the current version). Hope this helps! Chris -- Chris Samuel : https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=04%7C01%7Cd.j.baker%40soton.ac.uk%7Ccc84ff45cb604a29dd6208d89c614721%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C63743128890119%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OuSpfkTGBscxqTfJ0CbvX44GanHn4J76p9tV1M1AqSw%3D&reserved=0 : Berkeley, CA, USA
Re: [slurm-users] Backfill pushing jobs back
Hello, Could I please follow up on the Slurm patch that relates to smaller jobs pushing large jobs back? My colleague downloaded and installed the most recent production version of Slurm today and tells me that it did not appear to resolve the issue. Just to note, we are currently running v19.05.8 and finding that the backfill mechanism pushes large jobs back. In theory, should the latest Slurm help us in sorting that issue out? I understand that we're testing v20.11.2, however I should clarify that with my colleague tomorrow. Does anyone have any comments, please? Is there any parameter that we need to set to activate the backfill patch, for example? Best regards, David From: slurm-users on behalf of Chris Samuel Sent: 09 December 2020 16:37 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Backfill pushing jobs back CAUTION: This e-mail originated outside the University of Southampton. Hi David, On 9/12/20 3:35 am, David Baker wrote: > We see the following issue with smaller jobs pushing back large jobs. We > are using slurm 19.05.8 so not sure if this is patched in newer releases. This sounds like a problem that we had at NERSC (small jobs pushing back multi-thousand node jobs), and we carried a local patch for which Doug managed to get upstreamed in 20.02.x (I think it landed in 20.02.3, but 20.02.6 is the current version). Hope this helps! Chris -- Chris Samuel : https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=04%7C01%7Cd.j.baker%40soton.ac.uk%7Ccc84ff45cb604a29dd6208d89c614721%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C63743128890119%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OuSpfkTGBscxqTfJ0CbvX44GanHn4J76p9tV1M1AqSw%3D&reserved=0 : Berkeley, CA, USA
[slurm-users] Backfill pushing jobs back
Hello, Last year I posted on this forum looking for some help on backfill in Slurm. We are currently using Slurm 19.05.8 and we find that backfilled (smaller) jobs tend to push back large jobs in our cluster. Chris Samuel replied to our post with the following response... This sounds like a problem that we had at NERSC (small jobs pushing back multi-thousand node jobs), and we carried a local patch for which Doug managed to get upstreamed in 20.02.x (I think it landed in 20.02.3, but 20.02.6 is the current version). We looked through the release notes and sure enough there is a reference to a job starvation patch, however I'm not sure that it is the relevant patch... (in 20.02.2) > -- Fix scheduling issue when there are not enough nodes available to run a > job > resulting in possible job starvation. We decided to download and install the latest production version, 20.11.2, of Slurm. One of my team members managed the installation and ran his backfill tests only to find that the above backfill issue was still present. Should we wind back to version 20.02.6 and insall/test that instead? Could someone please advise use? It would seem odd that a recent version of slurm would still have a backfill issue that starves larger job out. We're wondering if you have forgotten to configure something very fundamental, for example. Best regards, David
[slurm-users] Validating SLURM sreport cluster utilization report
Hi, We've been using the sreport cluster utilization report to report on Down time and therefore produce an uptime figure for the entire cluster. Which we hope will be above 99% or very close to, for every month of the year. Most of the time the figure that comes back is one that fits the perception of the day to day running of the cluster. We don't log node UP/DOWN in any way (beyond what slurm does) and rely on sreport as explained above. The December figure we have is lower than 99% and there are 438 slurm nodes in the cluster. In December we only remember having problems with 3 nodes. So at the moment off the top of the head we don't understand this reported Down time. Is anyone else relying on sreport for this metric? If so have you encountered this sort of situation? regards David - David Simpson - Senior Systems Engineer ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB David Simpson - peiriannydd uwch systemau ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB simpso...@cardiff.ac.uk<mailto:simpso...@cardiff.ac.uk> +44 29208 74657 COVID-19 Cardiff University is currently under remote work restrictions. Our staff are continuing normal work schedules, but responses may be slower than usual. We appreciate your patience during this unprecedented time COVID-19 Ar hyn o bryd mae Prifysgol Caerdydd o dan gyfyngiadau gweithio o bell. Mae ein staff yn parhau ag amserlenni gwaith arferol, ond gall ymatebion fod yn arafach na'r arfer. Rydym yn gwerthfawrogi eich amynedd yn ystod yr amser digynsail hwn.
Re: [slurm-users] Validating SLURM sreport cluster utilization report
Out of interest (for those that do record and/or report on uptime) if you aren't using the sreport cluster utilization report what alternative method are you using instead? If you are using sreport cluster utilization report have you encountered this? thanks David - David Simpson - Senior Systems Engineer ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB David Simpson - peiriannydd uwch systemau ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB simpso...@cardiff.ac.uk<mailto:simpso...@cardiff.ac.uk> +44 29208 74657 COVID-19 Cardiff University is currently under remote work restrictions. Our staff are continuing normal work schedules, but responses may be slower than usual. We appreciate your patience during this unprecedented time COVID-19 Ar hyn o bryd mae Prifysgol Caerdydd o dan gyfyngiadau gweithio o bell. Mae ein staff yn parhau ag amserlenni gwaith arferol, ond gall ymatebion fod yn arafach na'r arfer. Rydym yn gwerthfawrogi eich amynedd yn ystod yr amser digynsail hwn. From: slurm-users On Behalf Of David Simpson Sent: 22 January 2021 16:34 To: slurm-users@lists.schedmd.com Subject: [slurm-users] Validating SLURM sreport cluster utilization report Hi, We've been using the sreport cluster utilization report to report on Down time and therefore produce an uptime figure for the entire cluster. Which we hope will be above 99% or very close to, for every month of the year. Most of the time the figure that comes back is one that fits the perception of the day to day running of the cluster. We don't log node UP/DOWN in any way (beyond what slurm does) and rely on sreport as explained above. The December figure we have is lower than 99% and there are 438 slurm nodes in the cluster. In December we only remember having problems with 3 nodes. So at the moment off the top of the head we don't understand this reported Down time. Is anyone else relying on sreport for this metric? If so have you encountered this sort of situation? regards David - David Simpson - Senior Systems Engineer ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB David Simpson - peiriannydd uwch systemau ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB simpso...@cardiff.ac.uk<mailto:simpso...@cardiff.ac.uk> +44 29208 74657 COVID-19 Cardiff University is currently under remote work restrictions. Our staff are continuing normal work schedules, but responses may be slower than usual. We appreciate your patience during this unprecedented time COVID-19 Ar hyn o bryd mae Prifysgol Caerdydd o dan gyfyngiadau gweithio o bell. Mae ein staff yn parhau ag amserlenni gwaith arferol, ond gall ymatebion fod yn arafach na'r arfer. Rydym yn gwerthfawrogi eich amynedd yn ystod yr amser digynsail hwn.
[slurm-users] sacctmgr archive dump - no dump file produced, and data not purged?
Hi all: I have a new cluster, and I am attempting to dump all the accounting data that I generated in the test period before our official opening. Installation info: * Bright Cluster Manager 9.0 * Slurm 20.02.6 * Red Hat 8.1 In slurmdbd.conf, I have: ArchiveJobs=yes ArchiveSteps=yes ArchiveEvents=yes ArchiveSuspend=yes On the commandline, I do: $ sudo sacctmgr archive dump Directory=/data/Backups/Slurm PurgeEventAfter=1hours PurgeJobAfter=1hours PurgeStepAfter=1hours PurgeSuspendAfter=1hours This may result in loss of accounting database records (if Purge* options enabled). Are you sure you want to continue? (You have 30 seconds to decide) (N/y): y sacctmgr: slurmdbd: SUCCESS However, no dump file is produced. And if I run sreport, I still see data from last month. (I also tried "1hour", i.e. dropping the "s".) Is there something I am missing? Thanks, Dave Chin -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data
[slurm-users] Unsetting a QOS Flag?
Hello all: I have a QOS defined which has the Flaq DenyOnLimit set: $ sacctmgr show qos foo format=name,flags NameFlags -- foo DenyOnLimit How can I "unset" that Flag? I tried "sacctmgr modify qos foo unset Flags=DenyOnLimit", and "sacctmgr modify qos foo set Flags=NoDenyOnLimit", to no avail. Thanks in advance, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data
Re: [slurm-users] sacctmgr archive dump - no dump file produced, and data not purged?
Well, I seem to have figured it out. This worked and did what I wanted to (I think): $ sudo sacctmgr archive dump Directory=/data/Backups/Slurm PurgeEventAfter=1hour \ PurgeJobAfter=1hour PurgeStepAfter=1hour PurgeSuspendAfter=1hour \ PurgeUsageAfter=1hour Events Jobs Steps Suspend Usage This generated various usage dump files, and the job_table and step_table dumps. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode From: slurm-users on behalf of Chin,David Sent: Friday, February 5, 2021 15:47 To: Slurm-Users List Subject: [slurm-users] sacctmgr archive dump - no dump file produced, and data not purged? External. Hi all: I have a new cluster, and I am attempting to dump all the accounting data that I generated in the test period before our official opening. Installation info: * Bright Cluster Manager 9.0 * Slurm 20.02.6 * Red Hat 8.1 In slurmdbd.conf, I have: ArchiveJobs=yes ArchiveSteps=yes ArchiveEvents=yes ArchiveSuspend=yes On the commandline, I do: $ sudo sacctmgr archive dump Directory=/data/Backups/Slurm PurgeEventAfter=1hours PurgeJobAfter=1hours PurgeStepAfter=1hours PurgeSuspendAfter=1hours This may result in loss of accounting database records (if Purge* options enabled). Are you sure you want to continue? (You have 30 seconds to decide) (N/y): y sacctmgr: slurmdbd: SUCCESS However, no dump file is produced. And if I run sreport, I still see data from last month. (I also tried "1hour", i.e. dropping the "s".) Is there something I am missing? Thanks, Dave Chin -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data Drexel Internal Data Drexel Internal Data
[slurm-users] sreport cluster AccountUtilizationByUser showing utilization of a deleted account
Hello, all: Details: * slurm 20.02.6 * MariaDB 10.3.17 * RHEL 8.1 I have a fairshare setup. I went through a couple of iterations in testing of manually creating accounts and users that I later deleted before putting in what is to be the production setup. One of the deleted accounts is named "urcfadm" - in the slurm_acct_db → acct_table, the row (?) for that account has a value 1 in the "deleted" column: creation_time mod_timedeleted namedescription organization 1607378518 1611091499 1 urcfadm urcf_sysadmins research I also purged all Events, Jobs, Steps, Suspend, Usage that are older than 1 hour. sacctmgr archive dump Directory=/data/Backups/Slurm PurgeEventAfter=1hour \ PurgeJobAfter=1hour PurgeStepAfter=1hour PurgeSuspendAfter=1hour \ PurgeUsageAfter=1hour Events Jobs Steps Suspend Usage When I run sreport cluster AccountUtilizationByUser Start=2021-02-09 End=2021-02-10 -T billing I get numbers which don't add up as one goes up to the root node of the tree. And I have a line for the account "urcfadm": Cluster Account Login Proper Name TRES Name Used - --- - --- -- ... picotte urcfadm billing 110708 ... In the report period, no jobs ran under the urcfadm account. Is there a way to fix this without just purging all the data? If there is no "graceful" fix, is there a way I can "reset" the slurm_acct_db, i.e. actually purge all data in all tables? Thanks in advance, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data
Re: [slurm-users] prolog not passing env var to job
ahmet.mer...@uhem.itu.edu.tr wrote: > Prolog and TaskProlog are different parameters and scripts. You should > use the TaskProlog script to set env. variables. Can you tell me how to do this for srun? E.g. users request an interactive shell: srun -n 1 -t 600 --pty /bin/bash but the shell on the compute node does not have the env variables set. I use the same prolog script as TaskProlog, which sets it properly for jobs submitted with sbatch. Thanks in advance, Dave Chin -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode From: slurm-users on behalf of mercan Sent: Friday, February 12, 2021 16:27 To: Slurm User Community List ; Herc Silverstein ; slurm-us...@schedmd.com Subject: Re: [slurm-users] prolog not passing env var to job External. Hi; Prolog and TaskProlog are different parameters and scripts. You should use the TaskProlog script to set env. variables. Regards; Ahmet M. Drexel Internal Data
Re: [slurm-users] prolog not passing env var to job
Hi, Brian: So, this is my SrunProlog script -- I want a job-specific tmp dir, which makes for easy cleanup at end of job: #!/bin/bash if [[ -z ${SLURM_ARRAY_JOB_ID+x} ]] then export TMP="/local/scratch/${SLURM_JOB_ID}" export TMPDIR="${TMP}" export LOCAL_TMPDIR="${TMP}" export BEEGFS_TMPDIR="/beegfs/scratch/${SLURM_JOB_ID}" else export TMP="/local/scratch/${SLURM_ARRAY_JOB_ID}.${SLURM_ARRAY_TASK_ID}" export TMPDIR="${TMP}" export LOCAL_TMPDIR="${TMP}" export BEEGFS_TMPDIR="/beegfs/scratch/${SLURM_ARRAY_JOB_ID}.${SLURM_ARRAY_TASK_ID}" fi echo DEBUG srun_set_tmp.sh echo I am `whoami` /usr/bin/mkdir -p ${TMP} chmod 700 ${TMP} /usr/bin/mkdir -p ${BEEGFS_TMPDIR} chmod 700 ${BEEGFS_TMPDIR} And this is my srun session: picotte001::~$ whoami dwc62 picotte001::~$ srun -p def --mem 1000 -n 4 -t 600 --pty /bin/bash DEBUG srun_set_tmp.sh I am dwc62 node001::~$ echo $TMP /local/scratch/80472 node001::~$ ll !$ ll $TMP /bin/ls: cannot access '/local/scratch/80472': No such file or directory node001::~$ mkdir $TMP node001::~$ ll -d !$ ll -d $TMP drwxrwxr-x 2 dwc62 dwc62 6 Mar 4 11:52 /local/scratch/80472/ node001::~$ exit So, the "echo" and "whoami" statements are executed by the prolog script, as expected, but the mkdir commands are not? Thanks, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode From: slurm-users on behalf of Brian Andrus Sent: Thursday, March 4, 2021 10:12 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] prolog not passing env var to job External. It seems to me, if you are using srun directly to get an interactive shell, you can just run the script once you get your shell. You can set the variables and then run srun. It automatically exports the environment. If you want to change a particular one (or more), use something like --export=ALL,MYVAR=othervalue do 'man srun' and look at the --export option Brian Andrus On 3/3/2021 9:28 PM, Chin,David wrote: ahmet.mer...@uhem.itu.edu.tr<mailto:ahmet.mer...@uhem.itu.edu.tr> wrote: > Prolog and TaskProlog are different parameters and scripts. You should > use the TaskProlog script to set env. variables. Can you tell me how to do this for srun? E.g. users request an interactive shell: srun -n 1 -t 600 --pty /bin/bash but the shell on the compute node does not have the env variables set. I use the same prolog script as TaskProlog, which sets it properly for jobs submitted with sbatch. Thanks in advance, Dave Chin -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu<mailto:dw...@drexel.edu> 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu<mailto:urcf-supp...@drexel.edu> https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode From: slurm-users <mailto:slurm-users-boun...@lists.schedmd.com> on behalf of mercan <mailto:ahmet.mer...@uhem.itu.edu.tr> Sent: Friday, February 12, 2021 16:27 To: Slurm User Community List <mailto:slurm-users@lists.schedmd.com>; Herc Silverstein <mailto:herc.silverst...@schrodinger.com>; slurm-us...@schedmd.com<mailto:slurm-us...@schedmd.com> <mailto:slurm-us...@schedmd.com> Subject: Re: [slurm-users] prolog not passing env var to job External. Hi; Prolog and TaskProlog are different parameters and scripts. You should use the TaskProlog script to set env. variables. Regards; Ahmet M. Drexel Internal Data Drexel Internal Data
Re: [slurm-users] prolog not passing env var to job
Hi Brian: This works just as I expect for sbatch. The example srun execution I showed was a non-array job, so the first half of the "if []" statement holds. It is the second half, which deals with job arrays, which has the period. The value of TMP is correct, i.e. "/local/scratch/80472" And the command, in the prolog script is correct, i.e. "/usr/bin/mkdir -p ${TMP}" If I type that command during the interactive job, it does what I expect, i.e. creates the directory $TMP = /local/scratch/80472 Regards, Dave From: slurm-users on behalf of Brian Andrus Sent: Thursday, March 4, 2021 13:48 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] prolog not passing env var to job External. I think it isn't running how you think or there is something not provided in the description. You have: export TMP="/local/scratch/${SLURM_ARRAY_JOB_ID}.${SLURM_ARRAY_TASK_ID}" Notice that period in there. Then you have: node001::~$ echo $TMP /local/scratch/80472 There is no period. In fact, SLURM_ARRAY_JOB_ID should be blank too if you are not running as an array session. However, to your desire for a job-specific tmp directory: Check out the mktemp command. It should do just what you want. I use it for interactive desktop sessions for users to create the temp directory that is used for X sessions. You just need to make sure the user has write access to the directory you are creating the directory in (chmod 1777 for the parent directory is good) Brian Andrus On 3/4/2021 9:03 AM, Chin,David wrote: Hi, Brian: So, this is my SrunProlog script -- I want a job-specific tmp dir, which makes for easy cleanup at end of job: #!/bin/bash if [[ -z ${SLURM_ARRAY_JOB_ID+x} ]] then export TMP="/local/scratch/${SLURM_JOB_ID}" export TMPDIR="${TMP}" export LOCAL_TMPDIR="${TMP}" export BEEGFS_TMPDIR="/beegfs/scratch/${SLURM_JOB_ID}" else export TMP="/local/scratch/${SLURM_ARRAY_JOB_ID}.${SLURM_ARRAY_TASK_ID}" export TMPDIR="${TMP}" export LOCAL_TMPDIR="${TMP}" export BEEGFS_TMPDIR="/beegfs/scratch/${SLURM_ARRAY_JOB_ID}.${SLURM_ARRAY_TASK_ID}" fi echo DEBUG srun_set_tmp.sh echo I am `whoami` /usr/bin/mkdir -p ${TMP} chmod 700 ${TMP} /usr/bin/mkdir -p ${BEEGFS_TMPDIR} chmod 700 ${BEEGFS_TMPDIR} And this is my srun session: picotte001::~$ whoami dwc62 picotte001::~$ srun -p def --mem 1000 -n 4 -t 600 --pty /bin/bash DEBUG srun_set_tmp.sh I am dwc62 node001::~$ echo $TMP /local/scratch/80472 node001::~$ ll !$ ll $TMP /bin/ls: cannot access '/local/scratch/80472': No such file or directory node001::~$ mkdir $TMP node001::~$ ll -d !$ ll -d $TMP drwxrwxr-x 2 dwc62 dwc62 6 Mar 4 11:52 /local/scratch/80472/ node001::~$ exit So, the "echo" and "whoami" statements are executed by the prolog script, as expected, but the mkdir commands are not? Thanks, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu<mailto:dw...@drexel.edu> 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu<mailto:urcf-supp...@drexel.edu> https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode From: slurm-users <mailto:slurm-users-boun...@lists.schedmd.com> on behalf of Brian Andrus <mailto:toomuc...@gmail.com> Sent: Thursday, March 4, 2021 10:12 To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> <mailto:slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] prolog not passing env var to job External. It seems to me, if you are using srun directly to get an interactive shell, you can just run the script once you get your shell. You can set the variables and then run srun. It automatically exports the environment. If you want to change a particular one (or more), use something like --export=ALL,MYVAR=othervalue do 'man srun' and look at the --export option Brian Andrus On 3/3/2021 9:28 PM, Chin,David wrote: ahmet.mer...@uhem.itu.edu.tr<mailto:ahmet.mer...@uhem.itu.edu.tr> wrote: > Prolog and TaskProlog are different parameters and scripts. You should > use the TaskProlog script to set env. variables. Can you tell me how to do this for srun? E.g. users request an interactive shell: srun -n 1 -t 600 --pty /bin/bash but the shell on the compute node does not have the env variables set. I use the same prolog script as TaskProlog, which sets it properly for jobs submitted with sbatch. Thanks in advance, Dave Chin -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu<mailto:dw...@drexel.edu> 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu<mailto:urcf-supp...@drexel.edu> https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecod
Re: [slurm-users] prolog not passing env var to job
My mistake - from slurm.conf(5): SrunProlog runs on the node where the "srun" is executing. i.e. the login node, which explains why the directory is not being created on the compute node, while the echos work. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data
[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
Hi, all: I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete. Here's sacct output: JobIDJobName User PartitionNodeListElapsed State ExitCode ReqMem MaxRSS MaxVMSize AllocTRES AllocGRE -- - -- --- -- -- -- -- -- 83387 ProdEmisI+ foobdef node001 03:34:26 OUT_OF_ME+0:125 128Gn billing=16,cpu=16,node=1 83387.batch batch node001 03:34:26 OUT_OF_ME+0:125 128Gn 1617705K 7880672K cpu=16,mem=0,node=1 83387.extern extern node001 03:34:26 COMPLETED 0:0 128Gn 460K153196K billing=16,cpu=16,node=1 Thanks in advance, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data
Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
Here's seff output, if it makes any difference. In any case, the exact same job was run by the user on their laptop with 16 GB RAM with no problem. Job ID: 83387 Cluster: picotte User/Group: foob/foob State: OUT_OF_MEMORY (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 06:50:30 CPU Efficiency: 11.96% of 2-09:10:56 core-walltime Job Wall-clock time: 03:34:26 Memory Utilized: 1.54 GB Memory Efficiency: 1.21% of 128.00 GB -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode From: slurm-users on behalf of Paul Edmon Sent: Monday, March 15, 2021 14:02 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value External. One should keep in mind that sacct results for memory usage are not accurate for Out Of Memory (OoM) jobs. This is due to the fact that the job is typically terminated prior to next sacct polling period, and also terminated prior to it reaching full memory allocation. Thus I wouldn't trust any of the results with regards to memory usage if the job is terminated by OoM. sacct just can't pick up a sudden memory spike like that and even if it did it would not correctly record the peak memory because the job was terminated prior to that point. -Paul Edmon- On 3/15/2021 1:52 PM, Chin,David wrote: Hi, all: I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete. Here's sacct output: JobIDJobName User PartitionNodeListElapsed State ExitCode ReqMem MaxRSS MaxVMSize AllocTRES AllocGRE -- - -- --- -- -- -- -- -- 83387 ProdEmisI+ foobdef node001 03:34:26 OUT_OF_ME+0:125 128Gn billing=16,cpu=16,node=1 83387.batch batch node001 03:34:26 OUT_OF_ME+0:125 128Gn 1617705K 7880672K cpu=16,mem=0,node=1 83387.extern extern node001 03:34:26 COMPLETED 0:0 128Gn 460K153196K billing=16,cpu=16,node=1 Thanks in advance, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu<mailto:dw...@drexel.edu> 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu<mailto:urcf-supp...@drexel.edu> https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data Drexel Internal Data
Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
Hi Michael: I looked at the Matlab script: it's loading an xlsx file which is 2.9 kB. There are some "static" arrays allocated with ones() or zeros(), but those use small subsets (< 10 columns) of the loaded data, and outputs are arrays of 6x10. Certainly there are not 16e9 rows in the original file. Saved output .mat file is only 1.8kB. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode From: slurm-users on behalf of Renfro, Michael Sent: Monday, March 15, 2021 14:04 To: Slurm User Community List Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value External. Just a starting guess, but are you certain the MATLAB script didn’t try to allocate enormous amounts of memory for variables? That’d be about 16e9 floating point values, if I did the units correctly. Drexel Internal Data
Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
One possible datapoint: on the node where the job ran, there were two slurmstepd processes running, both at 100%CPU even after the job had ended. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode From: slurm-users on behalf of Chin,David Sent: Monday, March 15, 2021 13:52 To: Slurm-Users List Subject: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value External. Hi, all: I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete. Here's sacct output: JobIDJobName User PartitionNodeListElapsed State ExitCode ReqMem MaxRSS MaxVMSize AllocTRES AllocGRE -- - -- --- -- -- -- -- -- 83387 ProdEmisI+ foobdef node001 03:34:26 OUT_OF_ME+0:125 128Gn billing=16,cpu=16,node=1 83387.batch batch node001 03:34:26 OUT_OF_ME+0:125 128Gn 1617705K 7880672K cpu=16,mem=0,node=1 83387.extern extern node001 03:34:26 COMPLETED 0:0 128Gn 460K153196K billing=16,cpu=16,node=1 Thanks in advance, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data Drexel Internal Data Drexel Internal Data