from:"david"

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread David

to avoid having to replicate scheduler logic in > job_submit.lua... :) > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > > -- David Rhey --- Advanced Research Computing University of Michigan

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread David

tion it prevents jobs > from being queued! > Nothing in the documentation about --partition made me think that > forbidding access to one partition would make a job unqueueable... > > Diego > > Il 21/09/2023 14:41, David ha scritto: > > I would think that slurm would only

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread David

DC (USA) wrote: > On Sep 21, 2023, at 9:46 AM, David wrote: > > Slurm is working as it should. From your own examples you proved that; by > not submitting to b4 the job works. However, looking at man sbatch: > >-p, --partition= > Request a specific p

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-28 Thread David

lurmd nodes. > > Is there an expedited, simple, slimmed down upgrade path to follow if > we're looking at just a . level upgrade? > > Rob > > -- David Rhey --- Advanced Research Computing University of Michigan

Re: [slurm-users] TRES sreport per association

2023-11-16 Thread David

be very lengthy output. HTH, David On Sun, Nov 12, 2023 at 6:03 PM Kamil Wilczek wrote: > Dear All, > > is is possible to report GPU Minutes per association? Suppose > I have two associations like this: > >sacctmgr show assoc where user=$(whoami) > format=account%10,use

[slurm-users] "command not found"

2017-12-15 Thread david

not found. What would be way to deal with this situation ? what is common practice ? thanks, david

[slurm-users] External provisioning for accounts and other things (?)

2018-09-18 Thread David Rhey

d be extra interested in how you achieved that. Thanks! -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan

Re: [slurm-users] External provisioning for accounts and other things (?)

2018-09-18 Thread David Rhey

couple of the underlying libraries (Perl wrappers around sacctmgr and > sshare commands) are available on CPAN (Slurm::Sacctmgr, Slurm::Sshare); > the rest lack the polish and finish required for publishing on CPAN. > > On Tue, Sep 18, 2018 at 3:02 PM David Rhey wrote: > >>

Re: [slurm-users] External provisioning for accounts and other things (?)

2018-09-19 Thread David Rhey

Thanks! I'll check this out. Ya'll are awesome for the responses. On Wed, Sep 19, 2018 at 7:57 AM Chris Samuel wrote: > On Wednesday, 19 September 2018 5:00:58 AM AEST David Rhey wrote: > > > First time caller, long-time listener. Does anyone use any sort of > exter

[slurm-users] Priority access for a group of users

2019-02-15 Thread David Baker

anning to place the nodes in their own partition. The node owners will have priority access to the nodes in that partition, but will have no advantage when submitting jobs to the public resources. Does anyone please have any ideas how to deal with this? Best regards, David

Re: [slurm-users] How to request ONLY one CPU instead of one socket or one node?

2019-02-15 Thread David Rhey

tition=standard --mem=1G --pty bash [drhey@bn19 ~]$ echo $SLURM_CPUS_ON_NODE 4 HTH! David On Wed, Feb 13, 2019 at 9:24 PM Wang, Liaoyuan wrote: > Dear there, > > > > I wrote an analytic program to analyze my data. The analysis costs around > twenty days to analyze all data for

Re: [slurm-users] Priority access for a group of users

2019-02-15 Thread david baker

st regards, David On Fri, Feb 15, 2019 at 3:09 PM Paul Edmon wrote: > Yup, PriorityTier is what we use to do exactly that here. That said > unless you turn on preemption jobs may still pend if there is no space. We > run with REQUEUE on which has worked well. > > > -Paul Edm

[slurm-users] Question on billing tres information from sacct, sshare, and scontrol

2019-02-21 Thread David Rhey

le of theories, and have been looking through source code to try and understand a bit better. For context, I am trying to understand what a job costs, and what usage for an account over a span of say a month costs. Any insight is most appreciated! -- David Rhey --- Advanced Res

Re: [slurm-users] Priority access for a group of users

2019-03-01 Thread david baker

or run from the current state (needing check pointing)? Best regards, David On Tue, Feb 19, 2019 at 2:15 PM Prentice Bisbal wrote: > I just set this up a couple of weeks ago myself. Creating two partitions > is definitely the way to go. I created one partition, "general" for no

Re: [slurm-users] Priority access for a group of users

2019-03-04 Thread david baker

colleague's job and stays in pending status. Does anyone understand what might be wrong, please? Best regards, David On Fri, Mar 1, 2019 at 2:47 PM Antony Cleave wrote: > I have always assumed that cancel just kills the job whereas requeue will > cancel and then start from the beg

[slurm-users] How do I impose a limit the memory requested by a job?

2019-03-12 Thread David Baker

I can impose a memory limit on the jobs that are submitted to this partition. It doesn't make any sense to request more than the total usable memory on the nodes. So could anyone please advise me how to ensure that users cannot request more than the usable memory on the nodes. Best regar

Re: [slurm-users] How do I impose a limit the memory requested by a job?

2019-03-14 Thread david baker

Hello Paul, Thank you for your advice. That all makes sense. We're running diskless compute nodes and so the usable memory is less than the total memory. So I have added a memory check to my job_submit.lua -- see below. I think that all makes sense. Best regards, David -- Check memory/no

[slurm-users] Very large job getting starved out

2019-03-21 Thread David Baker

think that the PriorityDecayHalfLife was quite high at 14 days and so I reduced that to 7 days. For reference I've included the key scheduling settings from the cluster below. Does anyone have any thoughts, please? Best regards, David PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPe

Re: [slurm-users] Very large job getting starved out

2019-03-21 Thread David Baker

me. If you or anyone else has any relevant thoughts then please let me know. I particular I am keen to understand "assoc_limit_stop" and whether it is a relevant option in this situation. Best regards, David From: slurm-users on behalf of Cyrus Pro

Re: [slurm-users] Very large job getting starved out

2019-03-22 Thread David Baker

(Resources) Best regards, David From: slurm-users on behalf of Christopher Samuel Sent: 21 March 2019 17:54 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Very large job getting starved out On 3/21/19 6:55 AM, David Baker wrote: > it current

[slurm-users] Backfill advice

2019-03-23 Thread david baker

lt bf frequency -- should we really reduce the frequency and potentially reduce the number of bf jobs per group/user or total at each iteration? Currently, I think we are setting the per/user limit to 20. Any thoughts would be appreciated, please. Best regards, David

Re: [slurm-users] Backfill advice

2019-03-25 Thread David Baker

bf_ignore_newly_avail_nodes. I was interested to see that you had a similar discussion with SchedMD and did upgrade. I think I ought to update the bf configuration re my first paragraph and see how that goes before we bite the bullet and do the upgrade (we are at 18.08.0

[slurm-users] Slurm users meeting 2019?

2019-03-25 Thread david baker

you know what’s planned this year. Best regards, David Sent from my iPad

Re: [slurm-users] Slurm users meeting 2019?

2019-03-27 Thread David Baker

Thank you for the date and location of the this year's Slurm User Group Meeting. Best regards, David From: slurm-users on behalf of Jacob Jenson Sent: 25 March 2019 21:26:45 To: Slurm User Community List Subject: Re: [slurm-users] Slurm users meeting

[slurm-users] Effect of PriorityMaxAge on job throughput

2019-04-09 Thread David Baker

MaxAge" to 7-0 to 1-0. Before that change the larger jobs could hang around in the queue for days. Does it make sense therefore to further reduce PriorityMaxAge to less than 1 day? Your advice would be appreciated, please. Best regards, David

Re: [slurm-users] Effect of PriorityMaxAge on job throughput

2019-04-10 Thread David Baker

rs, please? I've attached a copy of the slurm.conf just in case you or anyone else wants to take a more complete overview. Best regards, David From: slurm-users on behalf of Michael Gutteridge Sent: 09 April 2019 18:59 To: Slurm User Community List Subjec

Re: [slurm-users] Effect of PriorityMaxAge on job throughput

2019-04-24 Thread David Baker

Hello Michael, Thank you for your email and apologies for my tardy response. I'm still sorting out my mailbox after an Easter break. I've taken your comments on board and I'll see how I go with your suggestions. Best regards, David From: slurm-u

[slurm-users] Slurm database failure messages

2019-05-07 Thread David Baker

of failures. For example -- see below. Does anyone understand what might be going wrong, why and whether we should be concerned, please? I understand that slurm databases can get quite large relatively quickly and so I wonder if this is memory related. Best regards, David [root@blue51 slurm

[slurm-users] Partition QOS limits not being applied

2019-05-09 Thread David Carlson

Hi SLURM users, I work on a cluster, and we recently transitioned to using SLURM on some of our nodes. However, we're currently having some difficulty limiting the number of jobs that a user can run simultaneously in particular partitions. Here are the steps we've taken: 1. Created a new QOS a

[slurm-users] Testing/evaluating new versions of slurm (19.05 in this case)

2019-05-16 Thread David Baker

l job data, however that simulator is based on an old version of slurm and (to be honest) it's slightly unreliable for serious study. It's certainly only useful for broad brush analysis, at the most. Please let me have your thoughts -- they would be appreciated. Best regards, David

[slurm-users] Updating slurm priority flags

2019-05-18 Thread david baker

e "dynamics" of existing and new jobs in the cluster? That is, I don't want existing jobs to lose out cf new jobs re overall priority. Your advice would be appreciated, please. Best regards, David

[slurm-users] Advice on setting up fairshare

2019-06-06 Thread David Baker

0.008264 1357382 0.88 hydrology da1g18 10.33 0 0.00 0.876289 Does that all make sense or am I missing something? I am, by the way, using the line PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE in my slurm.conf. Best regards, David

Re: [slurm-users] Advice on setting up fairshare

2019-06-07 Thread David Baker

(and eternally idle) users receive a fairshare of 1 as expected. It certainly makes the scripts/admin a great deal less cumbersome. Best regards, David From: slurm-users on behalf of Loris Bennett Sent: 07 June 2019 07:11:36 To: Slurm User Community List

[slurm-users] Deadlocks in slurmdbd logs

2019-06-19 Thread David Baker

that version is a bit more mature), however that may not be the case. Best regards, David [2019-06-19T00:00:02.728] error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction insert into "i5_assoc_usage_hour_table" . [2019-06-19T00:00:

[slurm-users] Requirement to run longer jobs

2019-07-03 Thread David Baker

ircumstances? I would be interested in your thoughts, please. Best regards, David

Re: [slurm-users] Requirement to run longer jobs

2019-07-05 Thread David Baker

Hello, Thank you to everyone who replied to my email. I'll need to experiment and see how I get on. Best regards, David From: slurm-users on behalf of Loris Bennett Sent: 04 July 2019 06:53 To: Slurm User Community List Subject: Re: [slurm-

Re: [slurm-users] Invalid qos specification

2019-07-15 Thread David Rhey

n error: > > $ salloc -p general -q debug -t 00:30:00 > salloc: error: Job submit/allocate failed: Invalid qos specification > > I'm sure I'm overlooking something obvious. Any idea what that may be? > I'm using slurm 18.08.8 on the slurm controller, and the clients

Re: [slurm-users] Cluster-wide GPU Per User limit

2019-07-17 Thread David Rhey

Unfortunately, I think you're stuck in setting it at the account level with sacctmgr. You could also set that limit as part of a QoS and then attach the QoS to the partition. But I think that's as granular as you can get for limiting TRES'. HTH! David On Wed, Jul 17, 2019 a

Re: [slurm-users] No error/output/run

2019-07-24 Thread David Rhey

ob 1277 > $ squeue > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > $ ls > in.lj slurm_script.sh > $ > > > What does that mean? > > Regards, > Mahmood > > > -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan

[slurm-users] Slurm node weights

2019-07-25 Thread David Baker

Hello, I'm experimenting with node weights and I'm very puzzled by what I see. Looking at the documentation I gathered that jobs will be allocated to the nodes with the lowest weight which satisfies their requirements. I have 3 nodes in a partition and I have defined the nodes like so.. Node

Re: [slurm-users] Slurm node weights

2019-07-25 Thread David Baker

Hello, As an update I note that I have tried restarting the slurmctld, however that doesn't help. Best regards, David From: slurm-users on behalf of David Baker Sent: 25 July 2019 11:47:35 To: slurm-users@lists.schedmd.com Subject: [slurm-users]

Re: [slurm-users] Slurm node weights

2019-07-25 Thread David Baker

anyone know if there any fix or alternative strategy that might help us to achieve the same result? Best regards, David From: slurm-users on behalf of Sarlo, Jeffrey S Sent: 25 July 2019 12:26 To: Slurm User Community List Subject: Re: [slurm-users] Slu

Re: [slurm-users] Slurm node weights

2019-07-25 Thread David Baker

the system to be at risk. Or alternatively, do we need to arrange downtime, etc? Best regards, David From: slurm-users on behalf of Sarlo, Jeffrey S Sent: 25 July 2019 13:04 To: Slurm User Community List Subject: Re: [slurm-users] Slurm node weights Th

[slurm-users] Slurm statesave directory -- location and management

2019-08-28 Thread David Baker

stored in the slurm database? In other words if you lose the statesave data or it gets corrupted then you will lose all running/queued jobs? Any advice on the management and location of the statesave directory in a dual controller system would be appreciated, please. Best regards, David

[slurm-users] oddity with users showing in sacctmgr and sreport

2019-09-12 Thread David Rhey

they aren't a part of the root hierarchy in sacctmgr. We're using 18.08.7. Thanks! -- David Rhey --- Advanced Research Computing - Technology Services University of Michigan

Re: [slurm-users] Maxjobs not being enforced

2019-09-17 Thread David Rhey

Hi, Tina, Could you send the command you ran? David On Tue, Sep 17, 2019 at 2:06 PM Tina Fora wrote: > Hello Slurm user, > > We have 'AccountingStorageEnforce=limits,qos' set in our slurm.conf. I've > added maxjobs=100 for a particular user causing havoc on our sha

Re: [slurm-users] Maxjobs not being enforced

2019-09-18 Thread David Rhey

Hi, Tina, Are you able to confirm whether or not you can view the limit for the user in scontrol as well? David On Tue, Sep 17, 2019 at 4:42 PM Tina Fora wrote: > > # sacctmgr modify user lif6 set maxjobs=100 > > # sacctmgr list assoc user=lif6 format=user,maxjobs,maxsubmit

[slurm-users] Advice on setting a partition QOS

2019-09-25 Thread David Baker

is set to cpus/user=1280, nodes/user=32. It's almost like the 32 cpus in the serial queue are being counted as nodes -- as per the pending reason. Could someone please help me understand this issue and how to avoid it? Best regards, David

Re: [slurm-users] Advice on setting a partition QOS

2019-09-25 Thread David Baker

simply in terms of cpu/user usage? That is, not cpus/user and nodes/user. Best regards, David From: slurm-users on behalf of Juergen Salk Sent: 25 September 2019 14:52 To: Slurm User Community List Subject: Re: [slurm-users] Advice on setting a partition QOS

[slurm-users] How to modify the normal QOS

2019-09-26 Thread David Baker

in this case I'm not sure if I can delete the normal QOS on a running cluster. I have tried commands like the following to no avail.. sacctmgr update qos normal set maxtresperuser=cpu=1280 Could anyone please help with this. Best regards, David

Re: [slurm-users] How to modify the normal QOS

2019-09-26 Thread David Baker

Dear Jurgen, Thank you for that. That does the expected job. It looks like the weirdness that I saw in the serial partition has now gone away and so that is good. Best regards, David From: slurm-users on behalf of Juergen Salk Sent: 26 September 2019 16:18 To

[slurm-users] Slurm very rarely assigned an estimated start time to a job

2019-10-02 Thread David Baker

tions/tips/tricks to make sure that slurm provides estimates? Any advice would be appreciated, please. Best regards, David

Re: [slurm-users] Does Slurm store "time in current state" values anywhere ?

2019-10-03 Thread David Rhey

Hi, What about scontrol show job to see various things like: SubmitTime, EligibleTime, AccrueTime etc? David On Thu, Oct 3, 2019 at 4:53 AM Kevin Buckley wrote: > Hi there, > > we're hoping to overcome an issue where some of our users are keen > on writing their own meta-

Re: [slurm-users] Slurm very rarely assigned an estimated start time to a job

2019-10-03 Thread David Rhey

We've been working to tune our backfill scheduler here. Here is a presentation some of you might have seen at a previous SLUG on tuning the backfill scheduler. HTH! https://slurm.schedmd.com/SUG14/sched_tutorial.pdf David On Wed, Oct 2, 2019 at 1:37 PM Mark Hahn wrote: > >(most li

[slurm-users] Running job using our serial queue

2019-11-04 Thread David Baker

ion between jobs (sometimes jobs can get stalled) is due to context switching at the kernel level, however (apart from educating users) how can we minimise that switching on the serial nodes? Best regards, David

Re: [slurm-users] Running job using our serial queue

2019-11-05 Thread David Baker

r compute nodes? Does that help? Whenever I check which processes are not being constrained by cgroups I only ever find a small group of system processes. Best regards, David From: slurm-users on behalf of Marcus Wagner Sent: 05 November 2019 07:47

Re: [slurm-users] Running job using our serial queue

2019-11-07 Thread David Baker

ry is configured as a resource on these shared nodes and users should take care to request sufficient memory for their job. More often than none I guess that users are wrongly assuming that the default memory allocation is sufficient. Best regards, David From: Marcus W

[slurm-users] oom-kill events for no good reason

2019-11-07 Thread David Baker

he point does anyone understand this behaviour and know how to squash it, please? Best regards, David [2019-11-07T16:14:52.551] Launching batch job 164978 for UID 57337 [2019-11-07T16:14:52.559] [164977.batch] task/cgroup: /slurm/uid_57337/job_164977: alloc=23640MB mem.limit=23640MB memsw.limit=unli

Re: [slurm-users] oom-kill events for no good reason

2019-11-12 Thread David Baker

Hello, Thank you all for your useful replies. I double checked that the oom-killer "fires" at the end of every job on our cluster. As you mention this isn't significant and not something to be concerned about. Best regards, David From: slurm-user

[slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread David Baker

above the other jobs in the cluster. Best regards, David

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread David Baker

tem. The larger jobs at the expense of the small fry for example, however that is a difficult decision that means that someone has got to wait longer for results.. Best regards, David From: slurm-users on behalf of Renfro, Michael Sent: 31 January 2020 13:27 To:

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread David Baker

being freed up in the cluster to make way for high priority work which again concerns me. If you could please share your backfill configuration then that would be appreciated, please. Finally, which version of Slurm are you running? We are using an early release of v18. Best regards, David

Re: [slurm-users] Longer queuing times for larger jobs

2020-02-04 Thread David Baker

Hello, Thank you very much again for your comments and the details of your slurm configuration. All the information is really useful. We are working on our cluster right now and making some appropriate changes. We'll see how we get on over the next 24 hours or so. Best regards,

Re: [slurm-users] Longer queuing times for larger jobs

2020-02-04 Thread David Baker

de job. I see very few jobs allocated by the scheduler. That is, messages like sched: Allocate JobId=296915 are few and far between and I never see any of the large jobs being allocated in the batch queue. Surely, this is not correct, however does anyone have any advice on what to check,

[slurm-users] Advice on using GrpTRESRunMin=cpu=

2020-02-12 Thread David Baker

nt in the config? We hoped that the queued jobs would not accrue priority. We haven't, for example, used "accrue always". Have I got that wrong? Could someone please advise us. Best regards, David [root@navy51 slurm]# sprio JOBID PARTITION PRIORITY SITEAGE

Re: [slurm-users] Job with srun is still RUNNING after node reboot

2020-03-31 Thread David Rhey

Hi, Yair, Out of curiosity have you checked to see if this is a runaway job? David On Tue, Mar 31, 2020 at 7:49 AM Yair Yarom wrote: > Hi, > > We have an issue where running srun (with --pty zsh), and rebooting the > node (from a different shell), the srun reports: &

Re: [slurm-users] Drain a single user's jobs

2020-04-01 Thread David Rhey

ut no new work to be submitted. HTH, David On Wed, Apr 1, 2020 at 5:57 AM Mark Dixon wrote: > Hi all, > > I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster. > > I'd like to stop user foo from submitting new jobs but allow their > existing jobs to ru

[slurm-users] Slurm unlink error messages -- what do they mean?

2020-04-23 Thread David Baker

ical explanation for the message on inspection. Best regards, David

Re: [slurm-users] Job Step Resource Requests are Ignored

2020-05-06 Thread David Braun

i'm not sure I understand the problem. If you want to make sure the preamble and postamble run even if the main job doesn't run you can use '-d' from the man page -d, --dependency= Defer the start of this job until the specified dependencies have been satisfie

Re: [slurm-users] How to view GPU indices of the completed jobs?

2020-06-10 Thread David Braun

p;1 env > .debug_info/environ 2>&1 if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then echo "SAVING CUDA ENVIRONMENT" echo env |grep CUDA > .debug_info/environ_cuda 2>&1 fi You could add something like this to one of the SLURM prologs to save

[slurm-users] Nodes do not return to service after scontrol reboot

2020-06-16 Thread David Baker

th this? We are about to update the node firmware and ensuring that the nodes are returned to service following their reboot would be useful. Best regards, David

Re: [slurm-users] Nodes do not return to service after scontrol reboot

2020-06-17 Thread David Baker

Hello Chris, Thank you for your comments. The scontrol reboot command is now working as expected. Best regards, David From: slurm-users on behalf of Christopher Samuel Sent: 16 June 2020 18:16 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users

[slurm-users] Slurm and shared file systems

2020-06-19 Thread David Baker

going to guess that there must be a shared file system, however it would be good if someone could please confirm this. Best regards, David

[slurm-users] Slurm -- using GPU cards with NVLINK

2020-09-10 Thread David Baker

potentially make use of memory on the paired card. Best regards, David [root@alpha51 ~]# nvidia-smi topo --matrix GPU0GPU1GPU2GPU3CPU AffinityNUMA Affinity GPU0 X NV2 SYS SYS 0,2,4,6,8,100 GPU1NV2 X SYS SYS 0,2,4,6,8,10

Re: [slurm-users] Slurm -- using GPU cards with NVLINK

2020-09-11 Thread David Baker

Hi Ryan, Thank you very much for your reply. That is useful. We'll see how we get on. Best regards, David From: slurm-users on behalf of Ryan Novosielski Sent: 11 September 2020 00:08 To: Slurm User Community List Subject: Re: [slurm-users] Slurm -- usin

[slurm-users] Accounts and QOS settings

2020-10-01 Thread David Baker

partition. My thought was to have two overlapping partitions each with the relevant QOS and account group access control. Perhaps I am making this too complicated. I would appreciate your advice, please. Best regards, David

[slurm-users] Controlling access to idle nodes

2020-10-06 Thread David Baker

like a two-way scavenger situation. Could anyone please help? I have, by the way, set up partition-based pre-emption in the cluster. This allows the general public to scavenge nodes owned by research groups. Best regards, David

[slurm-users] unable to run on all the logical cores

2020-10-07 Thread David Bellot

why TRES=cpu=2 Any idea on how to solve this problem and have 100% of the logical cores allocated? Best regards, David

Re: [slurm-users] unable to run on all the logical cores

2020-10-07 Thread David Bellot

chtools in this case) the jobs. I'm still investigating even if NumCPUs=1 now as it should be. Thanks. David On Thu, Oct 8, 2020 at 4:40 PM Rodrigo Santibáñez < rsantibanez.uch...@gmail.com> wrote: > Hi David, > > I had the same problem time ago when configuring my f

Re: [slurm-users] Controlling access to idle nodes

2020-10-08 Thread David Baker

Thank you very much for your comments. Oddly enough, I came up with the 3-partition model as well once I'd sent my email. So, your comments helped to confirm that I was thinking on the right lines. Best regards, David From: slurm-users on behalf of Thom

Re: [slurm-users] unable to run on all the logical cores

2020-10-11 Thread David Bellot

result, or should I rather launch 20 jobs per node and have each job split in two internally (using "parallel" or "future" for example)? On Thu, Oct 8, 2020 at 6:32 PM William Brown wrote: > R is single threaded. > > On Thu, 8 Oct 2020, 07:44 Diego Zuccato, wrot

[slurm-users] ninja and cmake

2020-11-24 Thread David Bellot

e and distcc exist and I use them, but here I want to test if it's possible to do it with Slurm (as a proof of concept). Cheers, David

[slurm-users] Backfill pushing jobs back

2020-12-09 Thread David Baker

ect behaviour? It is also weird that the pending jobs don't have a start time. I have increased the backfill parameters significantly, but it doesn't seem to affect this at all. SchedulerParameters=bf_window=14400,bf_resolution=2400,bf_max_job_user=80,bf_continue,default_queue_depth=1000,bf_interval=60 Best regards, David

Re: [slurm-users] Backfill pushing jobs back

2020-12-10 Thread David Baker

st regards, David From: slurm-users on behalf of Chris Samuel Sent: 09 December 2020 16:37 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Backfill pushing jobs back CAUTION: This e-mail originated outside the University of Southampton. Hi David, On

Re: [slurm-users] Backfill pushing jobs back

2020-12-21 Thread David Baker

e any parameter that we need to set to activate the backfill patch, for example? Best regards, David From: slurm-users on behalf of Chris Samuel Sent: 09 December 2020 16:37 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Backfill pushing jobs back CA

[slurm-users] Backfill pushing jobs back

2021-01-04 Thread David Baker

recent version of slurm would still have a backfill issue that starves larger job out. We're wondering if you have forgotten to configure something very fundamental, for example. Best regards, David

[slurm-users] Validating SLURM sreport cluster utilization report

2021-01-22 Thread David Simpson

ems with 3 nodes. So at the moment off the top of the head we don't understand this reported Down time. Is anyone else relying on sreport for this metric? If so have you encountered this sort of situation? regards David - David Simpson - Senior Systems Engineer ARCCA, Redwood

Re: [slurm-users] Validating SLURM sreport cluster utilization report

2021-01-29 Thread David Simpson

Out of interest (for those that do record and/or report on uptime) if you aren't using the sreport cluster utilization report what alternative method are you using instead? If you are using sreport cluster utilization report have you encountered this? thanks David - David Si

[slurm-users] sacctmgr archive dump - no dump file produced, and data not purged?

2021-02-05 Thread Chin,David

;s".) Is there something I am missing? Thanks, Dave Chin -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data

[slurm-users] Unsetting a QOS Flag?

2021-02-08 Thread Chin,David

ags=DenyOnLimit", and "sacctmgr modify qos foo set Flags=NoDenyOnLimit", to no avail. Thanks in advance, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusm

Re: [slurm-users] sacctmgr archive dump - no dump file produced, and data not purged?

2021-02-09 Thread Chin,David

Steps Suspend Usage This generated various usage dump files, and the job_table and step_table dumps. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki

[slurm-users] sreport cluster AccountUtilizationByUser showing utilization of a deleted account

2021-02-09 Thread Chin,David

er the urcfadm account. Is there a way to fix this without just purging all the data? If there is no "graceful" fix, is there a way I can "reset" the slurm_acct_db, i.e. actually purge all data in all tables? Thanks in advance, Dave -- David Chin, PhD

Re: [slurm-users] prolog not passing env var to job

2021-03-03 Thread Chin,David

shell on the compute node does not have the env variables set. I use the same prolog script as TaskProlog, which sets it properly for jobs submitted with sbatch. Thanks in advance, Dave Chin -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.57

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David

62 dwc62 6 Mar 4 11:52 /local/scratch/80472/ node001::~$ exit So, the "echo" and "whoami" statements are executed by the prolog script, as expected, but the mkdir commands are not? Thanks, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David

creating the directory in (chmod 1777 for the parent directory is good) Brian Andrus On 3/4/2021 9:03 AM, Chin,David wrote: Hi, Brian: So, this is my SrunProlog script -- I want a job-specific tmp dir, which makes for easy cleanup at end of job: #!/bin/bash if [[ -z ${SLURM_ARRAY_JOB

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David

My mistake - from slurm.conf(5): SrunProlog runs on the node where the "srun" is executing. i.e. the login node, which explains why the directory is not being created on the compute node, while the echos work. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@

[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David

m=0,node=1 83387.extern extern node001 03:34:26 COMPLETED 0:0 128Gn 460K153196K billing=16,cpu=16,node=1 Thanks in advance, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 21

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David

0 CPU Efficiency: 11.96% of 2-09:10:56 core-walltime Job Wall-clock time: 03:34:26 Memory Utilized: 1.54 GB Memory Efficiency: 1.21% of 128.00 GB -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David

t 16e9 rows in the original file. Saved output .mat file is only 1.8kB. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki git

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David

One possible datapoint: on the node where the job ran, there were two slurmstepd processes running, both at 100%CPU even after the job had ended. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp

1 2 >

1 - 100 of 198 matches

Mail list logo