On Monday, 25 February 2019 2:55:44 AM PST Patrice Peterson wrote:
> Filed a bug: https://bugs.schedmd.com/show_bug.cgi?id=6573
Looks like Danny fixed it in git.
https://github.com/SchedMD/slurm/commit/b1c78d9934ef461df637c57c001eb165a6b1fcc3
--
Chris Samuel : http://www.csamuel.org/ : B
On Tuesday, 26 February 2019 10:03:34 AM PST Brian Andrus wrote:
> One thing I have noticed is that the END field for jobs with a state of
> FAILED is "Unknown" but the ELAPSED field has the time it ran.
That shouldn't happen, it works fine here (and where I've used Slurm in
Australia).
$ sacct
On Wednesday, 27 February 2019 1:08:56 PM PST Michael Gutteridge wrote:
> Yes, we do have time limits set on partitions- 7 days maximum, 3 days
> default. In this case, the larger job is requesting 3 days of walltime,
> the smaller jobs are requesting 7.
It sounds like no forward reservation is
On Wednesday, 27 February 2019 5:06:37 PM PST hu...@sugon.com wrote:
> I have a cluster with 9 nodes(cmbc[1530-1538]) , each node has 2
> cpus and each cpu has 32cores, but when I submitted a heterogeneous job
> twice ,the second job terminated unexpectedly.
Does this work if you use Open
Dear there,
I have a cluster with 9 nodes(cmbc[1530-1538]) , each node has 2 cpus
and each cpu has 32cores,
but when I submitted a heterogeneous job twice ,the second job terminated
unexpectedly.
This problem has been bothering me all day. Slurm version is 18.08.5 and here
is the job :
I am not very familiar with the Slurm power saving stuff. You might want
to look at BatchStartTimeout Parameter (See e.g.
https://slurm.schedmd.com/power_save.html)
Otherwise, what state are the Slurm power saving powered-down nodes in when
powered-down? From man pages sounds like should be idle
> You have not provided enough information (cluster configuration, job
information, etc) to diagnose what accounting policy is being violated.
Yeah, sorry. I'm trying to balance the amount of information and likely
skewed too concise 8-/
The partition looks like:
PartitionName=largenode
Allo
Yes, we do have time limits set on partitions- 7 days maximum, 3 days
default. In this case, the larger job is requesting 3 days of walltime,
the smaller jobs are requesting 7.
Thanks
M
On Wed, Feb 27, 2019 at 12:41 PM Andy Riebs wrote:
> Michael, are you setting time limits for the jobs? Tha
The "JobId=2210784 delayed for accounting policy is likely the key as it
indicates the job is currently unable to run, so the lower priority smaller
job bumps ahead of it.
You have not provided enough information (cluster configuration, job
information, etc) to diagnose what accounting policy is be
Michael, are you setting time limits for the jobs? That's a huge part of
a scheduler's decision about whether another job can be run. For
example, if a job is running with the Slurm default of "infinite," the
scheduler will likely decide that jobs that will fit in the remaining
nodes will be ab
I've run into a problem with a cluster we've got in a cloud provider-
hoping someone might have some advice.
The problem is that I've got a circumstance where large jobs _never_
start... or more correctly, that large-er jobs don't start when there are
many smaller jobs in the partition. In this c
Hi
I don't know what version of Slurm you're using or how it may be different
from the one I'm using (18.05), but here's my understanding of memory
limits and what I'm seeing on our cluster. The parameter
`JobAcctGatherParams=OverMemoryKill` controls whether a step is killed if
it goes over the r
I think If you increase the share of mygroup to something like 999 then the
share that the root user gets will drop by a factor of 1000
pretty sure I've seen this before and that's how I fixed it
Antony
On Wed, 27 Feb 2019 at 13:47, Will Dennis wrote:
> Looking at output of 'sshare", I see:
>
Hi Will,
as long as you do not submit massive number of jobs as root, there
should be no problem.
This is only a priority thing, so root will have a fairly high priority,
it does not mean, the users can only use half of your cluster.
Best
Marcus
On 2/27/19 2:43 PM, Will Dennis wrote:
Looki
Looking at output of 'sshare", I see:
root@myserver:~# sshare -l
Account User RawShares NormSharesRawUsage NormUsage
EffectvUsage FairShare
-- -- --- --- ---
- --
root
15 matches
Mail list logo