Re: [slurm-users] Runaway jobs issue: : Resource temporarily unavailable, slurm 17.11.3

2018-04-24 Thread Chris Samuel
On Wednesday, 25 April 2018 3:47:17 PM AEST Chris Samuel wrote: > I'll open a bug just in case.. https://bugs.schedmd.com/show_bug.cgi?id=5097 -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Re: [slurm-users] Runaway jobs issue: : Resource temporarily unavailable, slurm 17.11.3

2018-04-24 Thread Chris Samuel
On Wednesday, 25 April 2018 5:59:38 AM AEST Christopher Benjamin Coffey wrote: > #define MAX_MSG_SIZE (16*1024*1024) That is really really strange, there are 4 different definitions of that symbol in the source code. $ git grep 'define MAX_MSG_SIZE' src/common/slurm_persist_conn.c:#define MA

Re: [slurm-users] Runaway jobs issue: : Resource temporarily unavailable, slurm 17.11.3

2018-04-24 Thread Christopher Benjamin Coffey
We've gotten around the issue where we could not remove the runaway jobs. We had to go the manual route of manipulating the db directly. We actually used a great script that Loris Bennet wrote a while back. I haven't had to use it for a long while - thanks again! :) An item of interest for the

[slurm-users] Runaway jobs issue: : Resource temporarily unavailable, slurm 17.11.3

2018-04-24 Thread Christopher Benjamin Coffey
Hi, we have an issue currently where we have a bunch (56K) of runaway jobs, but we cannot clear them: sacctmgr show runaway|wc -l sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable 58588 Has anyone run

Re: [slurm-users] Slurm overhead

2018-04-24 Thread Ryan Novosielski
I would likely crank up the debugging on the slurmd process and look at the log files to see what’s going on in that time. You could also watch the job via top or other means (on Linux, you can press “1” to see line-by-line for each CPU core), or use strace on the process itself. Presumably some

Re: [slurm-users] Slurm overhead

2018-04-24 Thread Bill Barth
How do you start it? If you use Sys V style startup scripts, then likely /etc/Init.d/slurm stop, but if you;re using systemd, then probably systemctl stop slurm.service (but I don’t do systemd). Best, Bill. Sent from my phone > On Apr 24, 2018, at 11:15 AM, Mahmood Naderan wrote: > > Hi Bi

Re: [slurm-users] Include some cores of the head node to a partition

2018-04-24 Thread Mahmood Naderan
Chris, So the problem still exists ;) >Yes, if you are happy >for the asymmetry then you can do that. That is the question. MaxCPUsPerNode is for symmetrically set the max core number for all nodes in the partition. That is not applicable for asymmetric cases. Regards, Mahmood On Mon, Apr 23

Re: [slurm-users] Slurm overhead

2018-04-24 Thread Mahmood Naderan
Hi Bill, In order to shutdown the slurm process on the compute node, is it fine to kill /usr/sbin/slurm? Or there is a better and safer way for that? Regards, Mahmood On Sun, Apr 22, 2018 at 5:44 PM, Bill Barth wrote: > Mahmood, > > If you have exclusive control of this system and can afford

Re: [slurm-users] Partition 'alias'?

2018-04-24 Thread Chris Samuel
On Tuesday, 24 April 2018 5:40:22 PM AEST Diego Zuccato wrote: > I'd say they do *not* act as a single partition... Unless I missed some > key detail, once a node is assigned a job in a partition, it's > unavailable *as a whole* to other partitions. No, that's not right, we have overlapping parti

Re: [slurm-users] Partition 'alias'?

2018-04-24 Thread Diego Zuccato
Il 20/04/2018 15:56, Renfro, Michael ha scritto: > Not sure how to answer if they “essentially act as a single partition”, > though. Resources allocated to a job in a given partition are unavailable to > other jobs, regardless of what partition they’re in. I'd say they do *not* act as a single p