[slurm-users] Job not cancelled after "TimeLimit" supered

2020-03-10 Thread sysadmin.caos
Hi, my SLURM cluster has configured a partition with a "TimeLimit" of 8 hours. Now, a job is running during 9h30m and it has been not cancelled. During these 9 hours and a half, a script has executed a "scontrol update partition=mypartition state=down" for disabling this partition (educationa

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-10 Thread Kirill 'kkm' Katsnelson
Yes, it's odd. -kkm On Mon, Mar 9, 2020 at 7:44 AM mike tie wrote: > > Interesting. I'm still confused by the where slurmd -C is getting the > data. When I think of where the kernel stores info about the processor, I > normally think of /proc/cpuinfo. (by the way, I am running centos 7 in

Re: [slurm-users] Job not cancelled after "TimeLimit" supered

2020-03-10 Thread Ole Holm Nielsen
On 3/10/20 9:03 AM, sysadmin.caos wrote: my SLURM cluster has configured a partition with a "TimeLimit" of 8 hours. Now, a job is running during 9h30m and it has been not cancelled. During these 9 hours and a half, a script has executed a "scontrol update partition=mypartition state=down" for d

Re: [slurm-users] update node config while jobs are running

2020-03-10 Thread Andy Georges
Hi, On Tue, Mar 10, 2020 at 05:49:07AM +, Rundall, Jacob D wrote: > I need to update the configuration for the nodes in a cluster and I’d like to > let jobs keep running while I do so. Specifically I need to add > RealMemory= to the node definitions (NodeName=). Is it safe to do this > for

Re: [slurm-users] Job not cancelled after "TimeLimit" supered

2020-03-10 Thread Gestió Servidors
Hello, I have checked my configuration with "scontrol show config" and these are the values of that three parameters: AccountingStorageEnforce = none EnforcePartLimits = NO OverTimeLimit = 500 min ...so now I understand by my job hasn't been cancelled after 8 hours... because th

Re: [slurm-users] srun --reboot option is not working

2020-03-10 Thread Brian Andrus
I built/ran a quick test on older slurm and do see the issue. Looks like a possible bug. I would open a bug with SchedMD. I couldn't think of a good work-around, since the job would get rescheduled to a different node if you reboot, even if you have the node update it's own status at boot. It

[slurm-users] Diminishing the priority of an account

2020-03-10 Thread Jason Macklin
Hi, We are trying to setup accounts by user groups and I have one group that I'd like to drop the priority from the default of 1 (FairShare). I'm assuming that this is accomplished with the sacctmgr command, but haven't been able to figure out the exact syntax. Assuming this is the correct me

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-10 Thread mike tie
Here is the output of lstopo *$* lstopo -p Machine (63GB) Package P#0 + L3 (16MB) L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0 L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1 L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#2 L2 (4096KB) + L1d

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-10 Thread Kirill 'kkm' Katsnelson
On Tue, Mar 10, 2020 at 1:41 PM mike tie wrote: > Here is the output of lstopo > > *$* lstopo -p > > Machine (63GB) > > Package P#0 + L3 (16MB) > > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0 > > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1 > > L2 (4096KB