[slurm-users] Job not cancelled after "TimeLimit" supered

2020-03-10 Thread sysadmin.caos

Hi,

my SLURM cluster has configured a partition with a "TimeLimit" of 8 
hours. Now, a job is running during 9h30m and it has been not cancelled. 
During these 9 hours and a half, a script has executed a "scontrol 
update partition=mypartition state=down" for disabling this partition 
(educational cluster and at 8:00 start students classes).


Why my job hasn't been cancelled? There is no any log at SLURM 
controller that explains this behaviour.


Thanks.



Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-10 Thread Kirill 'kkm' Katsnelson
Yes, it's odd.


 -kkm

On Mon, Mar 9, 2020 at 7:44 AM mike tie  wrote:

>
> Interesting.   I'm still confused by the where slurmd -C is getting the
> data.  When I think of where the kernel stores info about the processor, I
> normally think of /proc/cpuinfo. (by the way, I am running centos 7 in the
> vm.  The vm hypervisor is VMware).  /proc/cpuinfo does show 16 cores.
>

AFAIK, the topology can be queried from /sys/devices/system/node/node*/ <
https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html> and
/sys/devices/system/cpu/cpu*/topology.

Whether or not Slurm in fact gets the topology from there, I do not know.
The build has dependencies on both libhwloc and libnuma--that's a clue.


> I understand your concern over the processor speed.  So I tried a
> different vm where I see the following specs:
>

It's not even so much its speed per se, rather the way the hypervisor has
finely chopped the 16 virtual CPUs into 4 sockets without hyperthreads. It
makes no sense at all. I have a hunch that the other VM (the one that
reports the correct CPU) should rather put them into a single socket, at
least by default. But yeah, it does not answer the question where the
number 10 is popping up from.


> When I increase the core count on that vm, reboot, and run slurm -C it too
> continues to show the lower original core count.
>

Most likely it's stored somewhere on disk.


> Specifically, how is slurmd -C getting that info?  Maybe this is a kernel
> issue, but other than lscpu and /proc/cpuinfo, I don't know where to look.
>

I would not bet 1 to 100 on a kernel bug. The number is most likely to come
from either some stray config file, or a cache on disk. I don't know if
slurmd stores any cache, never had to look (all my nodes are virtual and
created and deleted on demand, thus always start fresh), but if it does,
it's somewhere under /var/lib/slurm*.

I thought (possibly incorrectly) that the switch -C reports the node size
and CPU configuration without even looking at config files. I would check
first if it talks to the controller at all (tweak e.g. the port number in
slurm.conf), and, if it does, what is the current slurmctld's idea about
this node (scontrol show node=, IIRC, or something very much like
that).


>   Maybe I should be looking at the slurmd source?
>

slurmd should be much simpler than slurmctld, and the -C query must be a
straightforward, very synchronous operation. But reading sources is quite
time-consuming, so I would venture into it only as a last resort. Since -C
is not forking, it should be easy to run it under gdb. YMMV, of course.

 -kkm


Re: [slurm-users] Job not cancelled after "TimeLimit" supered

2020-03-10 Thread Ole Holm Nielsen

On 3/10/20 9:03 AM, sysadmin.caos wrote:
my SLURM cluster has configured a partition with a "TimeLimit" of 8 hours. 
Now, a job is running during 9h30m and it has been not cancelled. During 
these 9 hours and a half, a script has executed a "scontrol update 
partition=mypartition state=down" for disabling this partition 
(educational cluster and at 8:00 start students classes).


Why my job hasn't been cancelled? There is no any log at SLURM controller 
that explains this behaviour.


You may want to check the following parameter in your slurm.conf file 
(read the man-page first):


AccountingStorageEnforce: This controls what level of association-based 
enforcement to impose on job submissions.


You may want to read about EnforcePartLimits and OverTimeLimit parameters 
as well.


Display your current configuration by: scontrol show config

/Ole





Re: [slurm-users] update node config while jobs are running

2020-03-10 Thread Andy Georges
Hi,

On Tue, Mar 10, 2020 at 05:49:07AM +, Rundall, Jacob D wrote:
> I need to update the configuration for the nodes in a cluster and I’d like to 
> let jobs keep running while I do so. Specifically I need to add 
> RealMemory= to the node definitions (NodeName=). Is it safe to do this 
> for nodes where jobs are currently running? Or I need to make sure nodes are 
> drained while updating their config? We are using SelectType=select/linear on 
> this cluster. Users would only be allocating complete nodes.
> 
> Additionally, do I need to restart the Slurm daemons (slurmctld and slurmd) 
> to make this change? I understand if I were adding completely new nodes I 
> would need to do so (and that it’s advised to stop slurmctld, update config 
> files, restart slurmd on all computes, and then start slurmctld). But is 
> restarting the Slurm daemons also required when updating node config as I 
> would like to do, or would ‘scontrol reconfigure’ suffice?

If you want the change to be persistent, you will need to update the
slurm.conf (and/or other files in /etc/slurm). 

That said, scontrol reconfig should suffice to trigger the change in
running slurmd daemons. However, restarting slurmd and slurmctl is no big
deal afaik, provided you respect the timeouts that you've set. When
restarting slurmd, it will see the running jobs. When restarting
slurmctld, it will poll the nodes for info and regain knowledge of
running things. So it is no issue to do this live. I would first restart
slurmctld and then all the slurmds (after slurmctld is back up and
running properly).

Regards,
-- Andy


signature.asc
Description: PGP signature


Re: [slurm-users] Job not cancelled after "TimeLimit" supered

2020-03-10 Thread Gestió Servidors
Hello,

I have checked my configuration with "scontrol show config" and these are the 
values of that three parameters:
AccountingStorageEnforce = none
EnforcePartLimits   = NO
OverTimeLimit   = 500 min

...so now I understand by my job hasn't been cancelled after 8 hours... because 
there are 500 more minutes...

Thanks.

> --
> 
> Message: 2
> Date: Tue, 10 Mar 2020 11:25:08 +0100
> From: Ole Holm Nielsen 
> To: 
> Subject: Re: [slurm-users] Job not cancelled after "TimeLimit" supered
> Message-ID: <4491dd44-2f42-f0ca-2527-5eab39422...@fysik.dtu.dk>
> Content-Type: text/plain; charset="utf-8"; format=flowed
> 
> On 3/10/20 9:03 AM, sysadmin.caos wrote:
> > my SLURM cluster has configured a partition with a "TimeLimit" of 8 hours.
> > Now, a job is running during 9h30m and it has been not cancelled.
> > During these 9 hours and a half, a script has executed a "scontrol
> > update partition=mypartition state=down" for disabling this partition
> > (educational cluster and at 8:00 start students classes).
> >
> > Why my job hasn't been cancelled? There is no any log at SLURM
> > controller that explains this behaviour.
> 
> You may want to check the following parameter in your slurm.conf file (read
> the man-page first):
> 
> AccountingStorageEnforce: This controls what level of association-based
> enforcement to impose on job submissions.
> 
> You may want to read about EnforcePartLimits and OverTimeLimit
> parameters as well.
> 
> Display your current configuration by: scontrol show config
> 
> /Ole
> 
> 



Re: [slurm-users] srun --reboot option is not working

2020-03-10 Thread Brian Andrus
I built/ran a quick test on older slurm and do see the issue. Looks like 
a possible bug. I would open a bug with SchedMD.


I couldn't think of a good work-around, since the job would get 
rescheduled to a different node if you reboot, even if you have the node 
update it's own status at boot. It could probably be worked around, but 
not in a simple way. Easier to upgrade to the newest release :)


Brian Andrus

On 3/9/2020 10:14 AM, MrBr @ GMail wrote:

Hi Brian
The nodes work with slurm without any issues till I try the "--reboot" 
option.

I can successfully allocate the nodes or any other slurm related operation

> You may want to double check that the node is actually rebooting and
that slurmd is set to start on boot.
That's the problem, they are not been rebooted. I'm monitoring the nodes

sinfo from the nodes works without issue before and after using "--reboot"
slurmd is up


On Mon, Mar 9, 2020 at 5:59 PM Brian Andrus > wrote:


You may want to double check that the node is actually rebooting and
that slurmd is set to start on boot.

ResumeTimeoutReached, in a nutshell, means slurmd isn't talking to
slurmctld.
Are you able to log onto the node itself and see that it has rebooted?
If so, try doing something like 'sinfo' from the node and verify
it is
able to talk to slurmctld from the node and verify slurmd started
successfully.

Brian Andrus

On 3/9/2020 4:38 AM, MrBr @ GMail wrote:
> Hi all
>
> I'm trying to use the --reboot option of srun to reboot the nodes
> before allocation.
> However the nodes not been rebooted
>
> The node get's stuck in allocated# state as show by sinfo or CF
- as
> shown by squeue
> The logs of slurmctld and slurmd show no relevant information,
> debug levels at "debug5"
> Eventually the nodes got to "down" due to "ResumeTimeout reached"
>
> Strangest thing is that the "scontrol reboot " works
without
> any issues.
> AFAIK both command rely on the same RebootProgram
>
> In srun document there is a following statement: "This is only
> supported with some system configurations and will otherwise be
> silently ignored". May be I have this "non-supported" configuration?
>
> Does anyone has suggestion regarding root cause of this behavior or
> possible investigation path?
>
> Tech data:
> Slurm 19.05
> The user that executes the srun is an admin, although it's not
> required in 19.05



[slurm-users] Diminishing the priority of an account

2020-03-10 Thread Jason Macklin
Hi,

We are trying to setup accounts by user groups and I have one group that I'd 
like to drop the priority from the default of 1 (FairShare).  I'm assuming that 
this is accomplished with the sacctmgr command, but haven't been able to figure 
out the exact syntax.  Assuming this is the correct mechanism, what might this 
command look like?

Thank you!

Jason Macklin
HPC Systems Engineer, The Jackson Laboratory
Bar Harbor, ME | Farmington, CT | Sacramento, CA
860.837.2142 t | 860.202.7779 m
jason.mack...@jax.org
www.jax.org
The Jackson Laboratory: Leading the search for tomorrow's cures

---

The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-10 Thread mike tie
Here is the output of lstopo

*$* lstopo -p

Machine (63GB)

  Package P#0 + L3 (16MB)

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#2

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#3

  Package P#1 + L3 (16MB)

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#4

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#5

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#6

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#7

  Package P#2 + L3 (16MB)

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#8

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#9

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#10

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#11

  Package P#3 + L3 (16MB)

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#12

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#13

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#14

L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#15

  HostBridge P#0

PCI 8086:7010

  Block(Removable Media Device) "sr0"

PCI 1234:

  GPU "card0"

  GPU "controlD64"

PCI 1af4:1004

PCI 1af4:1000




*Michael Tie*Technical Director
Mathematics, Statistics, and Computer Science

 One North College Street  phn:  507-222-4067
 Northfield, MN 55057   cel:952-212-8933
 m...@carleton.edufax:507-222-4312


On Tue, Mar 10, 2020 at 12:21 AM Chris Samuel  wrote:

> On 9/3/20 7:44 am, mike tie wrote:
>
> > Specifically, how is slurmd -C getting that info?  Maybe this is a
> > kernel issue, but other than lscpu and /proc/cpuinfo, I don't know where
> > to look.  Maybe I should be looking at the slurmd source?
>
> It would be worth looking at what something like "lstopo" from the hwloc
> package says about your VM.
>
> All the best,
> Chris
> --
>   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>


Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-10 Thread Kirill 'kkm' Katsnelson
On Tue, Mar 10, 2020 at 1:41 PM mike tie  wrote:

> Here is the output of lstopo
>

> *$* lstopo -p
>
> Machine (63GB)
>
>   Package P#0 + L3 (16MB)
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#2
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#3
>
>   Package P#1 + L3 (16MB)
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#4
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#5
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#6
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#7
>
>   Package P#2 + L3 (16MB)
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#8
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#9
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#10
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#11
>
>   Package P#3 + L3 (16MB)
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#12
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#13
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#14
>
> L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#15
>

There is no sane way to derive the number 10 from this topology. obviously:
it has a prime factor of 5, but everything in the lstopo output is sized in
powers of 2 (4 packages, a.k.a.  sockets, 4 single-threaded CPU cores per).

I responded yesterday but somehow managed to plop my signature into the
middle of it, so maybe you have missed inline replies?

It's very, very likely that the number is stored *somewhere*. First to
eliminate is the hypothesis that the number is acquired from the control
daemon. That's the simplest step and the largest landgrab in the
divide-and-conquer analysis plan. Then just look where it comes from on the
VM. strace(1) will reveal all files slurmd reads.

You are not rolling out the VMs from an image, ain't you? I'm wondering why
do you need to tweak an existing VM that is already in a weird state. Is
simply setting its snapshot aside and creating a new one from an image
hard/impossible? I did not touch VMWare for more than 10 years, so I may be
a bit naive; in the platform I'm working now (GCE), create-use-drop pattern
of VM use is much more common and simpler than create and maintain it to
either *ad infinitum* or *ad nauseam*, whichever will have been reached the
earliest.  But I don't know anything about VMWare; maybe it's not possible
or feasible with it.

 -kkm