date:20190227

Re: [slurm-users] SLURM docs: HTML title should be same as page title

2019-02-27 Thread Chris Samuel

On Monday, 25 February 2019 2:55:44 AM PST Patrice Peterson wrote: > Filed a bug: https://bugs.schedmd.com/show_bug.cgi?id=6573 Looks like Danny fixed it in git. https://github.com/SchedMD/slurm/commit/b1c78d9934ef461df637c57c001eb165a6b1fcc3 -- Chris Samuel : http://www.csamuel.org/ : B

Re: [slurm-users] sacct end time for failed jobs

2019-02-27 Thread Chris Samuel

On Tuesday, 26 February 2019 10:03:34 AM PST Brian Andrus wrote: > One thing I have noticed is that the END field for jobs with a state of > FAILED is "Unknown" but the ELAPSED field has the time it ran. That shouldn't happen, it works fine here (and where I've used Slurm in Australia). $ sacct

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Chris Samuel

On Wednesday, 27 February 2019 1:08:56 PM PST Michael Gutteridge wrote: > Yes, we do have time limits set on partitions- 7 days maximum, 3 days > default. In this case, the larger job is requesting 3 days of walltime, > the smaller jobs are requesting 7. It sounds like no forward reservation is

Re: [slurm-users] 转发: a heterogeneous job terminate unexpectedly

2019-02-27 Thread Chris Samuel

On Wednesday, 27 February 2019 5:06:37 PM PST hu...@sugon.com wrote: > I have a cluster with 9 nodes(cmbc[1530-1538]) , each node has 2 > cpus and each cpu has 32cores, but when I submitted a heterogeneous job > twice ,the second job terminated unexpectedly. Does this work if you use Open

[slurm-users] 转发: a heterogeneous job terminate unexpectedly

2019-02-27 Thread hu...@sugon.com

Dear there, I have a cluster with 9 nodes(cmbc[1530-1538]) , each node has 2 cpus and each cpu has 32cores, but when I submitted a heterogeneous job twice ,the second job terminated unexpectedly. This problem has been bothering me all day. Slurm version is 18.08.5 and here is the job :

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Thomas M. Payerle

I am not very familiar with the Slurm power saving stuff. You might want to look at BatchStartTimeout Parameter (See e.g. https://slurm.schedmd.com/power_save.html) Otherwise, what state are the Slurm power saving powered-down nodes in when powered-down? From man pages sounds like should be idle

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge

> You have not provided enough information (cluster configuration, job information, etc) to diagnose what accounting policy is being violated. Yeah, sorry. I'm trying to balance the amount of information and likely skewed too concise 8-/ The partition looks like: PartitionName=largenode Allo

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge

Yes, we do have time limits set on partitions- 7 days maximum, 3 days default. In this case, the larger job is requesting 3 days of walltime, the smaller jobs are requesting 7. Thanks M On Wed, Feb 27, 2019 at 12:41 PM Andy Riebs wrote: > Michael, are you setting time limits for the jobs? Tha

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Thomas M. Payerle

The "JobId=2210784 delayed for accounting policy is likely the key as it indicates the job is currently unable to run, so the lower priority smaller job bumps ahead of it. You have not provided enough information (cluster configuration, job information, etc) to diagnose what accounting policy is be

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Andy Riebs

Michael, are you setting time limits for the jobs? That's a huge part of a scheduler's decision about whether another job can be run. For example, if a job is running with the Slurm default of "infinite," the scheduler will likely decide that jobs that will fit in the remaining nodes will be ab

[slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge

I've run into a problem with a cluster we've got in a cloud provider- hoping someone might have some advice. The problem is that I've got a circumstance where large jobs _never_ start... or more correctly, that large-er jobs don't start when there are many smaller jobs in the partition. In this c

Re: [slurm-users] Different slurm.conf for master and nodes

2019-02-27 Thread Michael Gutteridge

Hi I don't know what version of Slurm you're using or how it may be different from the one I'm using (18.05), but here's my understanding of memory limits and what I'm seeing on our cluster. The parameter `JobAcctGatherParams=OverMemoryKill` controls whether a step is killed if it goes over the r

Re: [slurm-users] Fairshare - root user

2019-02-27 Thread Antony Cleave

I think If you increase the share of mygroup to something like 999 then the share that the root user gets will drop by a factor of 1000 pretty sure I've seen this before and that's how I fixed it Antony On Wed, 27 Feb 2019 at 13:47, Will Dennis wrote: > Looking at output of 'sshare", I see: >

Re: [slurm-users] Fairshare - root user

2019-02-27 Thread Marcus Wagner

Hi Will, as long as you do not submit massive number of jobs as root, there should be no problem. This is only a priority thing, so root will have a fairly high priority, it does not mean, the users can only use half of your cluster. Best Marcus On 2/27/19 2:43 PM, Will Dennis wrote: Looki

[slurm-users] Fairshare - root user

2019-02-27 Thread Will Dennis

Looking at output of 'sshare", I see: root@myserver:~# sshare -l Account User RawShares NormSharesRawUsage NormUsage EffectvUsage FairShare -- -- --- --- --- - -- root

Re: [slurm-users] SLURM docs: HTML title should be same as page title

Re: [slurm-users] sacct end time for failed jobs

Re: [slurm-users] Large job starvation on cloud cluster

Re: [slurm-users] 转发: a heterogeneous job terminate unexpectedly

[slurm-users] 转发: a heterogeneous job terminate unexpectedly

Re: [slurm-users] Large job starvation on cloud cluster

Re: [slurm-users] Large job starvation on cloud cluster

Re: [slurm-users] Large job starvation on cloud cluster

Re: [slurm-users] Large job starvation on cloud cluster

Re: [slurm-users] Large job starvation on cloud cluster

[slurm-users] Large job starvation on cloud cluster

Re: [slurm-users] Different slurm.conf for master and nodes

Re: [slurm-users] Fairshare - root user

Re: [slurm-users] Fairshare - root user

[slurm-users] Fairshare - root user

15 matches

Site Navigation

Mail list logo

Footer information