Re: [slurm-users] how do slurm schedule health check when setting "HealthCheckNodeState=CYCLE"

2020-12-02 Thread Yair Yarom
Hi,

We also noticed this. We eventually placed the max time on the
HealthCheckInterval (65535), and created a systemd.timer which runs the
scripts externally of slurm, with proper intervals and randomized delays.

Yair.

On Wed, Dec 2, 2020 at 9:03 AM  wrote:

> Hello,
>
>
>
> Our slurm cluster managed about 600+ nodes and I tested to set
> HealthCheckNodeState=CYCLE in slurm.conf. According to conf manual, setting
> this to CYCLE shall cause slurm to “cycle through running on all compute
> nodes through the course of the HealthCheckInterval”. So I set
> “HealthCheckInterval = 600”, and expected the health check time point can
> be evenly distributed across the 600 seconds period.
>
> But the test result showed that the earliest checked node is at about
> 14:19:35, while the latest checked node is at about 14:20:39. A round of
> the health checks only distributed across 60+ seconds? And the previous
> checking round distributed from 14:08:10 to 14:09:26, it seems the
> HealthCheckInterval only control the time interval between two rounds, not
> the time range distributed by one round checkings.
>
> So did I mistake the description in conf’s manual? And is there any method
> can control the health check frequency in one round between different nodes?
>
>
>
> Thanks.
>


[slurm-users] job restart :: how to find the reason

2020-12-02 Thread Adrian Sevcenco

Hi! I encountered a situation when a bunch of jobs were restarted
and this is seen from Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0

So, i would like to know, how i can i find why there is a Requeue
(when there is only one partition defined) and why there is a restart ..

Thanks a lot!!!
Adrian



[slurm-users] Randomize Slurm Node Allocation

2020-12-02 Thread Fabio Moreira
Hi,

I would like to know if Slurm has any configuration to enable a randomize
node allocation, since we have 256 nodes in our cluster and the first nodes
are always allocated at first. Is there any way to allocate them in an
aleatory way? We have already added the option "LLN=YES" to the partition
but it did not work.

Thanks in advance,

Fábio MS


Re: [slurm-users] Randomize Slurm Node Allocation

2020-12-02 Thread Adrian Sevcenco

On 12/2/20 1:27 PM, Fabio Moreira wrote:

Hi,

I would like to know if Slurm has any configuration to enable a
randomize node allocation, since we have 256 nodes in our cluster and
the first nodes are always allocated at first. Is there any way to
allocate them in an aleatory way? We have already added the option
"LLN=YES" to the partition but it did not work.

I had a similar situation and in my case the culprit was that the nodes
defined in node configuration file had a Weight set.. after i removed the Weight
then the LLN started to work.
Moreover you might want to make sure that if you have not other requirements
you have priority/basic for PriorityType

HTH,
Adrian



Re: [slurm-users] job restart :: how to find the reason

2020-12-02 Thread Paul Edmon
You can dig through the slurmctld log and search for the JobID. That 
should tell you what Slurm was doing at the time.


-Paul Edmon-

On 12/2/2020 6:27 AM, Adrian Sevcenco wrote:

Hi! I encountered a situation when a bunch of jobs were restarted
and this is seen from Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 
ExitCode=0:0


So, i would like to know, how i can i find why there is a Requeue
(when there is only one partition defined) and why there is a restart ..

Thanks a lot!!!
Adrian





[slurm-users] slurm_pam_adapt & configless - set-up

2020-12-02 Thread Heckes, Frank
Hello all,
sorry if this has been asked and/or answered before. I couldn’t find a posting 
related to my problem.

 

I’m using slurm 20.02.01 and use a configless – set-up for all login and 
compute nodes.

I set-up slurm PAM on a test node following the instructions at 
https://slurm.schedmd.com/pam_slurm_adopt.html.



Without the ‘existance’ of /etc/slurm/slurm.conf (and world readable 
permissions for all dirs and the file) I see the following error message in 
messages or journal:

 

error: s_p_parse_file: unable to status file /etc/slurm/slurm.conf: No such 
file or directory, retrying in 1sec up to 60sec



The effect is that it takes awful long to login to the node. 

 

Once I copy the slurm.conf (actually the thing I tried to avoid utilizing the 
configless set-up), the ssh login is permit without latency if a job is running 
and not allowed if no jobs are running for the user trying to access the node. 
Does anyone know a workaround (besides ‘mine’) or solution to fix this? Many 
thanks in advance.

Cheers,

-Frank



smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-12-02 Thread Robert Kudyba
>
> been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It
> seems to have started to occur when I enabled proctrack/cgroup and changed
> select/linear to select/con_tres.
>
Our slurm.conf has the same setting:
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60
EnforcePartLimits=YES

We enabled MPS too. Not sure if that's relevant.


> Are you using cgroup process tracking and have you manipulated the
> cgroup.conf file?
>
Here's what we have in ours:
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=no
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
TaskAffinity=no
ConstrainCores=no
ConstrainRAMSpace=no
ConstrainSwapSpace=no
ConstrainDevices=no
ConstrainKmemSpace=yes
AllowedRamSpace=100
AllowedSwapSpace=0
MinKmemSpace=30
MaxKmemPercent=100
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

  Do jobs complete correctly when not cancelled?


Yes they do and canceling doesn't always result in a node draining.

So would this be a Slurm issue or Bright? I'm telling users to add 'sleep
60' as the last line in their sbatch files.


Re: [slurm-users] job restart :: how to find the reason

2020-12-02 Thread Adrian Sevcenco

On 12/2/20 4:18 PM, Paul Edmon wrote:

You can dig through the slurmctld log and search for the JobID. That should 
tell you what Slurm was doing at the time.

Aha, thanks a lot! Found the culprit:

[2020-12-02T06:45:14.200] error: Nodes issaf-0-1 not responding
[2020-12-02T06:45:28.212] requeue job JobId=29594 due to failure of node 
issaf-0-1
[2020-12-02T06:45:28.212] Requeuing JobId=29594
...

[2020-12-02T06:45:28.213] error: Nodes issaf-0-1 not responding, setting DOWN
[2020-12-02T06:45:28.248] Node issaf-0-1 now responding
[2020-12-02T06:45:28.248] node_did_resp: node issaf-0-1 returned to service
2020-12-02T06:45:28.700] _job_complete: JobId=29594 WTERMSIG 15
[2020-12-02T06:45:28.700] _job_complete: JobId=29594 cancelled by interactive 
user
..

[2020-12-02T06:47:30.304] sched: Allocate JobId=29594 NodeList=issaf-0-1 
#CPUs=1 Partition=CLUSTER

The weird thing is that i have continuous monitoring (ganglia) data, but this 
is beyond the scope of this list.

Thanks a lot!
Adrian




-Paul Edmon-

On 12/2/2020 6:27 AM, Adrian Sevcenco wrote:

Hi! I encountered a situation when a bunch of jobs were restarted
and this is seen from Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0

So, i would like to know, how i can i find why there is a Requeue
(when there is only one partition defined) and why there is a restart ..

Thanks a lot!!!
Adrian








[slurm-users] FairShare

2020-12-02 Thread Micheal Krombopulous
Can someone tell me how to calculate fairshare (under fairtree)? I can't figure 
it out. I would have thought it would be the same score for all users in an 
account. E.g., here is one of my accounts:

Account User  RawShares  NormShares    RawUsage   NormUsage  EffectvUsage   
 LevelFS  FairShare
 -- -- --- --- --- 
- -- --
root                                           0.00      611349         
         1.00
 root                      root             1    0.076923           0    
0.00      0.00        inf   1.00
 sray                                  1    0.076923      30921 
0.505582      0.505582   0.152147
  sray                 phedge            1    0.05           0    0.00  
    0.00        inf   0.181818
  sray                raab          1    0.05           0    
0.00      0.00        inf   0.181818
  sray                benequist          1    0.05           0    0.00  
    0.00        inf   0.181818
  sray                 bosch           1    0.05           0    
0.00      0.00        inf   0.181818
  sray                rjenkins         1    0.05           0    
0.00      0.00        inf   0.181818
  sray                  esmith            1    0.05           0    0.00 
     0.00 1.7226e+07   0.054545
  sray                  gheinz            1    0.05           0    0.00 
     0.00 1.9074e+14   0.072727
  sray                  jfitz         1    0.05           0    
0.00      0.00 8.0640e+20   0.081818
  sray                   ajoel          1    0.05       42449    
0.069465      0.137396   0.363913   0.018182
  sray                  jmay           1    0.05           0    
0.00      0.00        inf   0.181818
  sray                 aferrier            1    0.05           0    
0.00      0.00        inf   0.181818
  sray                bdehaven         1    0.05      225002    0.367771    
  0.727420   0.068736   0.009091
  sray                msmythe          1    0.05           0    0.00    
  0.00        inf   0.181818
  sray                 gfink           1    0.05           0    
0.00      0.00 2.0343e+05   0.045455
  sray                ahantau           1    0.05          31    0.51   
   0.000102 491.737549   0.036364
  sray                 hmiller            1    0.05           0    0.00 
     0.00        inf   0.181818
  sray                   ttinker          1    0.05           0    0.00 
     0.00 1.4798e+13   0.063636
  sray                wcooper          1    0.05           0    0.00    
  0.00        inf   0.181818
  sray                 xtsao          1    0.05       41734    0.068296 
     0.135083   0.370143   0.027273
  sray                   xping            1    0.05           0    0.00 
     0.00 1.9833e+24   0.090909




Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox

Micheal,

Details are at https://slurm.schedmd.com/fair_tree.html 
. If they have the same shares 
and usage as each other, they will have the same fair share value.  One 
thing to keep in mind is that sshare rounds or truncates the values, so 
0.00 does not necessarily mean that a value is actually 0. 
https://slurm.schedmd.com/SUG14/fair_tree.pdf has more details, starting 
at page 34 or so.


Ryan

On 12/2/20 9:32 AM, Micheal Krombopulous wrote:

Can someone tell me how to calculate fairshare (under fairtree)? I can't figure 
it out. I would have thought it would be the same score for all users in an 
account. E.g., here is one of my accounts:

Account User  RawShares  NormShares    RawUsage   NormUsage  EffectvUsage   
 LevelFS  FairShare
 -- -- --- --- --- 
- -- --
root                                           0.00      611349         
         1.00
  root                      root             1    0.076923           0    
0.00      0.00        inf   1.00
  sray                                  1    0.076923      30921 
0.505582      0.505582   0.152147
   sray                 phedge            1    0.05           0    0.00 
     0.00        inf   0.181818
   sray                raab          1    0.05           0    
0.00      0.00        inf   0.181818
   sray                benequist          1    0.05           0    0.00 
     0.00        inf   0.181818
   sray                 bosch           1    0.05           0    
0.00      0.00        inf   0.181818
   sray                rjenkins         1    0.05           0    
0.00      0.00        inf   0.181818
   sray                  esmith            1    0.05           0    
0.00      0.00 1.7226e+07   0.054545
   sray                  gheinz            1    0.05           0    
0.00      0.00 1.9074e+14   0.072727
   sray                  jfitz         1    0.05           0    
0.00      0.00 8.0640e+20   0.081818
   sray                   ajoel          1    0.05       42449    
0.069465      0.137396   0.363913   0.018182
   sray                  jmay           1    0.05           0    
0.00      0.00        inf   0.181818
   sray                 aferrier            1    0.05           0    
0.00      0.00        inf   0.181818
   sray                bdehaven         1    0.05      225002    0.367771   
   0.727420   0.068736   0.009091
   sray                msmythe          1    0.05           0    0.00   
   0.00        inf   0.181818
   sray                 gfink           1    0.05           0    
0.00      0.00 2.0343e+05   0.045455
   sray                ahantau           1    0.05          31    0.51  
    0.000102 491.737549   0.036364
   sray                 hmiller            1    0.05           0    
0.00      0.00        inf   0.181818
   sray                   ttinker          1    0.05           0    
0.00      0.00 1.4798e+13   0.063636
   sray                wcooper          1    0.05           0    0.00   
   0.00        inf   0.181818
   sray                 xtsao          1    0.05       41734    
0.068296      0.135083   0.370143   0.027273
   sray                   xping            1    0.05           0    
0.00      0.00 1.9833e+24   0.090909







Re: [slurm-users] FairShare

2020-12-02 Thread Micheal Krombopulous
I've read the manual and I re-read the other link. What they boil down to is 
Fair Share is calculated based on a recondite "rooted plane tree", which I do 
not have the background in discrete math to understand.

I'm hoping someone can explain it so my little kernel can understand.

From: slurm-users  on behalf of Micheal 
Krombopulous 
Sent: Wednesday, December 2, 2020 9:32 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] FairShare

Can someone tell me how to calculate fairshare (under fairtree)? I can't figure 
it out. I would have thought it would be the same score for all users in an 
account. E.g., here is one of my accounts:

Account User  RawShares  NormSharesRawUsage   NormUsage  EffectvUsage   
 LevelFS  FairShare
 -- -- --- --- --- 
- -- --
root   0.00  611349 
 1.00
 root  root 10.076923   0
0.00  0.00inf   1.00
 sray  10.076923  30921 
0.505582  0.505582   0.152147
  sray phedge10.05   00.00  
0.00inf   0.181818
  srayraab  10.05   0
0.00  0.00inf   0.181818
  sraybenequist  10.05   00.00  
0.00inf   0.181818
  sray bosch   10.05   0
0.00  0.00inf   0.181818
  srayrjenkins 10.05   0
0.00  0.00inf   0.181818
  sray  esmith10.05   00.00 
 0.00 1.7226e+07   0.054545
  sray  gheinz10.05   00.00 
 0.00 1.9074e+14   0.072727
  sray  jfitz 10.05   0
0.00  0.00 8.0640e+20   0.081818
  sray   ajoel  10.05   42449
0.069465  0.137396   0.363913   0.018182
  sray  jmay   10.05   0
0.00  0.00inf   0.181818
  sray aferrier10.05   0
0.00  0.00inf   0.181818
  sraybdehaven 10.05  2250020.367771
  0.727420   0.068736   0.009091
  sraymsmythe  10.05   00.00
  0.00inf   0.181818
  sray gfink   10.05   0
0.00  0.00 2.0343e+05   0.045455
  srayahantau   10.05  310.51   
   0.000102 491.737549   0.036364
  sray hmiller10.05   00.00 
 0.00inf   0.181818
  sray   ttinker  10.05   00.00 
 0.00 1.4798e+13   0.063636
  sraywcooper  10.05   00.00
  0.00inf   0.181818
  sray xtsao  10.05   417340.068296 
 0.135083   0.370143   0.027273
  sray   xping10.05   00.00 
 0.00 1.9833e+24   0.090909




Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox
It's really similar to a binary search tree.  Within each account, it is 
Shares / Usage to calculate the Level FS.  See 
https://slurm.schedmd.com/SUG14/fair_tree.pdf has more details, starting 
at page 34 or so.  It even has an "animation".


Ryan

On 12/2/20 10:22 AM, Micheal Krombopulous wrote:
I've read the manual and I re-read the other link. What they boil down 
to is Fair Share is calculated based on a recondite "rooted plane 
tree", which I do not have the background in discrete math to understand.


I'm hoping someone can explain it so my little kernel can understand.

*From:* slurm-users  on behalf 
of Micheal Krombopulous 

*Sent:* Wednesday, December 2, 2020 9:32 AM
*To:* slurm-users@lists.schedmd.com 
*Subject:* [slurm-users] FairShare
Can someone tell me how to calculate fairshare (under fairtree)? I 
can't figure it out. I would have thought it would be the same score 
for all users in an account. E.g., here is one of my accounts:


Account User  RawShares  NormShares    RawUsage NormUsage 
 EffectvUsage    LevelFS  FairShare
 -- -- --- --- 
--- - -- --

root  0.00      611349                  1.00
 root                      root             1    0.076923           0 
   0.00      0.00        inf 1.00
 sray                                  1  0.076923     
 30921 0.505582      0.505582   0.152147
  sray                 phedge            1    0.05       0   
 0.00      0.00        inf   0.181818
  sray                raab          1    0.05           0 
   0.00      0.00        inf 0.181818
  sray                benequist          1    0.05       0   
 0.00      0.00        inf   0.181818
  sray                 bosch           1    0.05         0   
 0.00      0.00        inf   0.181818
  sray                rjenkins         1    0.05         0   
 0.00      0.00        inf   0.181818
  sray                  esmith            1    0.05         0   
 0.00      0.00 1.7226e+07   0.054545
  sray                  gheinz            1    0.05         0   
 0.00      0.00 1.9074e+14   0.072727
  sray                  jfitz         1  0.05           0 
   0.00      0.00 8.0640e+20   0.081818
  sray                   ajoel          1    0.05       42449 
   0.069465      0.137396   0.363913 0.018182
  sray                  jmay           1    0.05         0   
 0.00      0.00        inf   0.181818
  sray                 aferrier            1    0.05         0   
 0.00      0.00        inf   0.181818
  sray                bdehaven         1    0.05  225002   
 0.367771      0.727420   0.068736   0.009091
  sray                msmythe          1    0.05     0    0.00 
     0.00        inf   0.181818
  sray                 gfink           1    0.05         0   
 0.00      0.00 2.0343e+05   0.045455
  sray                ahantau           1    0.05      31   
 0.51      0.000102 491.737549   0.036364
  sray                 hmiller            1    0.05         0   
 0.00      0.00        inf   0.181818
  sray                   ttinker          1    0.05         0   
 0.00      0.00 1.4798e+13   0.063636
  sray                wcooper          1    0.05     0    0.00 
     0.00        inf   0.181818
  sray                 xtsao          1    0.05     41734   
 0.068296      0.135083   0.370143   0.027273
  sray                   xping            1    0.05         0   
 0.00      0.00 1.9833e+24   0.090909







Re: [slurm-users] FairShare

2020-12-02 Thread Renfro, Michael
Yesterday, I posted 
https://docs.rc.fas.harvard.edu/kb/fairshare/
 in response to a similar question. If you want the simplest general 
explanation for FairShare values, it's that they range from 0.0 to 1.0, values 
above 0.5 indicate that account or user has used less than their share of the 
resource, and values below 0.5 indicate that that account or user has used more 
than their share of the resource.

Since all your users have the same RawShares value and are entitled to the same 
share of the resource, you can see that bdehaven has the most RawUsage and the 
lowest FairShare value, followed by ajoel and xtsao with almost identical 
RawUsage and FairShare, and finally ahantau with very little usage and the 
highest FairShare value.

We use FairShare here as the dominant factor in priorities for queued jobs: if 
you're a light user, we bump up your priority over heavier users, and your job 
starts quicker than those for heavier users, assuming all other job attributes 
are equal.

All these values are relative: in our setup, we'd bump ahantau's pending jobs 
ahead of the others, and put bdehaven's at the end. But if root needed to run a 
job outside the sray account, they'd get an enormous bump ahead since the sray 
account has used far more than its fair share of the resource.

From: slurm-users 
Date: Wednesday, December 2, 2020 at 11:23 AM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] FairShare

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


I've read the manual and I re-read the other link. What they boil down to is 
Fair Share is calculated based on a recondite "rooted plane tree", which I do 
not have the background in discrete math to understand.

I'm hoping someone can explain it so my little kernel can understand.

From: slurm-users  on behalf of Micheal 
Krombopulous 
Sent: Wednesday, December 2, 2020 9:32 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] FairShare

Can someone tell me how to calculate fairshare (under fairtree)? I can't figure 
it out. I would have thought it would be the same score for all users in an 
account. E.g., here is one of my accounts:

Account User  RawShares  NormSharesRawUsage   NormUsage  EffectvUsage   
 LevelFS  FairShare
 -- -- --- --- --- 
- -- --
root   0.00  611349 
 1.00
 root  root 10.076923   0
0.00  0.00inf   1.00
 sray  10.076923  30921 
0.505582  0.505582   0.152147
  sray phedge10.05   00.00  
0.00inf   0.181818
  srayraab  10.05   0
0.00  0.00inf   0.181818
  sraybenequist  10.05   00.00  
0.00inf   0.181818
  sray bosch   10.05   0
0.00  0.00inf   0.181818
  srayrjenkins 10.05   0
0.00  0.00inf   0.181818
  sray  esmith10.05   00.00 
 0.00 1.7226e+07   0.054545
  sray  gheinz10.05   00.00 
 0.00 1.9074e+14   0.072727
  sray  jfitz 10.05   0
0.00  0.00 8.0640e+20   0.081818
  sray   ajoel  10.05   42449
0.069465  0.137396   0.363913   0.018182
  sray  jmay   10.05   0
0.00  0.00inf   0.181818
  sray aferrier10.05   0
0.00  0.00inf   0.181818
  sraybdehaven 10.05  2250020.367771
  0.727420   0.068736   0.009091
  sraymsmythe  10.05   00.00
  0.00inf   0.181818
  sray gfink   10.05   0
0.00  0.00 2.0343e+05   0.045455
  srayahantau   10.05  310.5

Re: [slurm-users] FairShare

2020-12-02 Thread Erik Bryer
I'm not talking about the Level Fair Share. That's easy to compute. I'm talking 
about Fair Share -- what sshare prints out on the rightmost side.

From: slurm-users  on behalf of Ryan Cox 

Sent: Wednesday, December 2, 2020 10:31 AM
To: Slurm User Community List ; Micheal 
Krombopulous 
Subject: Re: [slurm-users] FairShare

It's really similar to a binary search tree.  Within each account, it is Shares 
/ Usage to calculate the Level FS.  See 
https://slurm.schedmd.com/SUG14/fair_tree.pdf has more details, starting at 
page 34 or so.  It even has an "animation".

Ryan

On 12/2/20 10:22 AM, Micheal Krombopulous wrote:
I've read the manual and I re-read the other link. What they boil down to is 
Fair Share is calculated based on a recondite "rooted plane tree", which I do 
not have the background in discrete math to understand.

I'm hoping someone can explain it so my little kernel can understand.

From: slurm-users 

 on behalf of Micheal Krombopulous 

Sent: Wednesday, December 2, 2020 9:32 AM
To: slurm-users@lists.schedmd.com 

Subject: [slurm-users] FairShare

Can someone tell me how to calculate fairshare (under fairtree)? I can't figure 
it out. I would have thought it would be the same score for all users in an 
account. E.g., here is one of my accounts:

Account User  RawShares  NormSharesRawUsage   NormUsage  EffectvUsage   
 LevelFS  FairShare
 -- -- --- --- --- 
- -- --
root   0.00  611349 
 1.00
 root  root 10.076923   0
0.00  0.00inf   1.00
 sray  10.076923  30921 
0.505582  0.505582   0.152147
  sray phedge10.05   00.00  
0.00inf   0.181818
  srayraab  10.05   0
0.00  0.00inf   0.181818
  sraybenequist  10.05   00.00  
0.00inf   0.181818
  sray bosch   10.05   0
0.00  0.00inf   0.181818
  srayrjenkins 10.05   0
0.00  0.00inf   0.181818
  sray  esmith10.05   00.00 
 0.00 1.7226e+07   0.054545
  sray  gheinz10.05   00.00 
 0.00 1.9074e+14   0.072727
  sray  jfitz 10.05   0
0.00  0.00 8.0640e+20   0.081818
  sray   ajoel  10.05   42449
0.069465  0.137396   0.363913   0.018182
  sray  jmay   10.05   0
0.00  0.00inf   0.181818
  sray aferrier10.05   0
0.00  0.00inf   0.181818
  sraybdehaven 10.05  2250020.367771
  0.727420   0.068736   0.009091
  sraymsmythe  10.05   00.00
  0.00inf   0.181818
  sray gfink   10.05   0
0.00  0.00 2.0343e+05   0.045455
  srayahantau   10.05  310.51   
   0.000102 491.737549   0.036364
  sray hmiller10.05   00.00 
 0.00inf   0.181818
  sray   ttinker  10.05   00.00 
 0.00 1.4798e+13   0.063636
  sraywcooper  10.05   00.00
  0.00inf   0.181818
  sray xtsao  10.05   417340.068296 
 0.135083   0.370143   0.027273
  sray   xping10.05   00.00 
 0.00 1.9833e+24   0.090909





Re: [slurm-users] FairShare

2020-12-02 Thread Erik Bryer
I read that link. If Fair Share is so rational (low users get high scores, and 
high users get low scores), then why do ajoel's and xtsao's Fair Share scores 
differ this much? Their Level Fair Share scores make more sense.

>sray   ajoel  10.05   42449
> 0.069465  0.137396   0.363913   0.018182
>sray xtsao  10.05   41734
> 0.068296  0.135083   0.370143   0.027273

Which brings me back to my OP: How is Fair Share calculated?

From: slurm-users  on behalf of Renfro, 
Michael 
Sent: Wednesday, December 2, 2020 10:32 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] FairShare


Yesterday, I posted 
https://docs.rc.fas.harvard.edu/kb/fairshare/
 in response to a similar question. If you want the simplest general 
explanation for FairShare values, it's that they range from 0.0 to 1.0, values 
above 0.5 indicate that account or user has used less than their share of the 
resource, and values below 0.5 indicate that that account or user has used more 
than their share of the resource.



Since all your users have the same RawShares value and are entitled to the same 
share of the resource, you can see that bdehaven has the most RawUsage and the 
lowest FairShare value, followed by ajoel and xtsao with almost identical 
RawUsage and FairShare, and finally ahantau with very little usage and the 
highest FairShare value.



We use FairShare here as the dominant factor in priorities for queued jobs: if 
you're a light user, we bump up your priority over heavier users, and your job 
starts quicker than those for heavier users, assuming all other job attributes 
are equal.



All these values are relative: in our setup, we'd bump ahantau's pending jobs 
ahead of the others, and put bdehaven's at the end. But if root needed to run a 
job outside the sray account, they'd get an enormous bump ahead since the sray 
account has used far more than its fair share of the resource.



From: slurm-users 
Date: Wednesday, December 2, 2020 at 11:23 AM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] FairShare

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.



I've read the manual and I re-read the other link. What they boil down to is 
Fair Share is calculated based on a recondite "rooted plane tree", which I do 
not have the background in discrete math to understand.



I'm hoping someone can explain it so my little kernel can understand.



From: slurm-users  on behalf of Micheal 
Krombopulous 
Sent: Wednesday, December 2, 2020 9:32 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] FairShare



Can someone tell me how to calculate fairshare (under fairtree)? I can't figure 
it out. I would have thought it would be the same score for all users in an 
account. E.g., here is one of my accounts:

Account User  RawShares  NormSharesRawUsage   NormUsage  EffectvUsage   
 LevelFS  FairShare
 -- -- --- --- --- 
- -- --
root   0.00  611349 
 1.00
 root  root 10.076923   0
0.00  0.00inf   1.00
 sray  10.076923  30921 
0.505582  0.505582   0.152147
  sray phedge10.05   00.00  
0.00inf   0.181818
  srayraab  10.05   0
0.00  0.00inf   0.181818
  sraybenequist  10.05   00.00  
0.00inf   0.181818
  sray bosch   10.05   0
0.00  0.00inf   0.181818
  srayrjenkins 10.05   0
0.00  0.00inf   0.181818
  sray  esmith10.05   00.00 
 0.00 1.7226e+07   0.054545
  sray  gheinz10.05   00.00 
 0.00 1.9074e+14   0.072727
  sray  jfitz 10.05   0
0.00  0.00 8.0640e+20   0.081818
  sray   ajoel   

Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox

From https://slurm.schedmd.com/fair_tree.html:
The basic idea is to set rank equal to the count of user associations 
then start at root:

*   Calculate Level Fairshare for the subtree's children
*   Sort children of the subtree
*   Visit the children in descending order:
-    If user, assign a final fairshare factor similar to (rank-- / 
user_assoc_count)

-    If account, descend to account


On 12/2/20 10:34 AM, Erik Bryer wrote:
I'm not talking about the Level Fair Share. That's easy to compute. 
I'm talking about Fair Share -- what sshare prints out on the 
rightmost side.


*From:* slurm-users  on behalf 
of Ryan Cox 

*Sent:* Wednesday, December 2, 2020 10:31 AM
*To:* Slurm User Community List ; 
Micheal Krombopulous 

*Subject:* Re: [slurm-users] FairShare
It's really similar to a binary search tree.  Within each account, it 
is Shares / Usage to calculate the Level FS.  See 
https://slurm.schedmd.com/SUG14/fair_tree.pdf 
 has more details, 
starting at page 34 or so.  It even has an "animation".


Ryan

On 12/2/20 10:22 AM, Micheal Krombopulous wrote:
I've read the manual and I re-read the other link. What they boil 
down to is Fair Share is calculated based on a recondite "rooted 
plane tree", which I do not have the background in discrete math to 
understand.


I'm hoping someone can explain it so my little kernel can understand.

*From:* slurm-users  
 on behalf of Micheal 
Krombopulous  


*Sent:* Wednesday, December 2, 2020 9:32 AM
*To:* slurm-users@lists.schedmd.com 
 
 

*Subject:* [slurm-users] FairShare
Can someone tell me how to calculate fairshare (under fairtree)? I 
can't figure it out. I would have thought it would be the same score 
for all users in an account. E.g., here is one of my accounts:


Account User  RawShares  NormShares    RawUsage NormUsage 
 EffectvUsage    LevelFS  FairShare
 -- -- --- --- 
--- - -- --

root  0.00      611349                  1.00
 root                      root             1  0.076923           0   
 0.00      0.00  inf   1.00
 sray                                  1  0.076923     
 30921 0.505582      0.505582   0.152147
  sray                 phedge            1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                raab          1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                benequist          1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                 bosch           1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                rjenkins         1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                  esmith            1  0.05           0   
 0.00      0.00 1.7226e+07   0.054545
  sray                  gheinz            1  0.05           0   
 0.00      0.00 1.9074e+14   0.072727
  sray                  jfitz         1  0.05           0 
   0.00      0.00 8.0640e+20   0.081818
  sray                   ajoel          1  0.05       42449   
 0.069465      0.137396 0.363913   0.018182
  sray                  jmay           1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                 aferrier            1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                bdehaven         1    0.05    225002   
 0.367771      0.727420   0.068736 0.009091
  sray                msmythe          1    0.05         0   
 0.00      0.00        inf 0.181818
  sray                 gfink           1  0.05           0   
 0.00      0.00 2.0343e+05   0.045455
  sray                ahantau           1    0.05          31   
 0.51      0.000102 491.737549 0.036364
  sray                 hmiller            1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                   ttinker          1  0.05           0   
 0.00      0.00 1.4798e+13   0.063636
  sray                wcooper          1    0.05         0   
 0.00      0.00        inf 0.181818
  sray                 xtsao          1  0.05       41734   
 0.068296      0.135083 0.370143   0.027273
  sray                   xping            1  0.05           0   
 0.00      0.00 1.9833e+24   0.090909









Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox

That is not for Fair Tree, which is what Micheal asked about.

Ryan

On 12/2/20 10:32 AM, Renfro, Michael wrote:


Yesterday, I posted https://docs.rc.fas.harvard.edu/kb/fairshare/ 
in 
response to a similar question. If you want the simplest general 
explanation for FairShare values, it's that they range from 0.0 to 
1.0, values above 0.5 indicate that account or user has used less than 
their share of the resource, and values below 0.5 indicate that that 
account or user has used more than their share of the resource.


Since all your users have the same RawShares value and are entitled to 
the same share of the resource, you can see that bdehaven has the most 
RawUsage and the lowest FairShare value, followed by ajoel and xtsao 
with almost identical RawUsage and FairShare, and finally ahantau with 
very little usage and the highest FairShare value.


We use FairShare here as the dominant factor in priorities for queued 
jobs: if you're a light user, we bump up your priority over heavier 
users, and your job starts quicker than those for heavier users, 
assuming all other job attributes are equal.


All these values are relative: in our setup, we'd bump ahantau's 
pending jobs ahead of the others, and put bdehaven's at the end. But 
if root needed to run a job outside the sray account, they'd get an 
enormous bump ahead since the sray account has used far more than its 
fair share of the resource.


*From: *slurm-users 
*Date: *Wednesday, December 2, 2020 at 11:23 AM
*To: *slurm-users@lists.schedmd.com 
*Subject: *Re: [slurm-users] FairShare

*External Email Warning*

*This email originated from outside the university. Please use caution 
when opening attachments, clicking links, or responding to requests.*




I've read the manual and I re-read the other link. What they boil down 
to is Fair Share is calculated based on a recondite "rooted plane 
tree", which I do not have the background in discrete math to understand.


I'm hoping someone can explain it so my little kernel can understand.



*From:*slurm-users  on behalf 
of Micheal Krombopulous 

*Sent:* Wednesday, December 2, 2020 9:32 AM
*To:* slurm-users@lists.schedmd.com 
*Subject:* [slurm-users] FairShare

Can someone tell me how to calculate fairshare (under fairtree)? I 
can't figure it out. I would have thought it would be the same score 
for all users in an account. E.g., here is one of my accounts:


Account User  RawShares  NormShares    RawUsage NormUsage 
 EffectvUsage    LevelFS  FairShare
 -- -- --- --- 
--- - -- --

root  0.00      611349                  1.00
 root                      root             1  0.076923           0   
 0.00      0.00  inf   1.00
 sray                                  1  0.076923     
 30921 0.505582      0.505582   0.152147
  sray                 phedge            1    0.05         0   
 0.00      0.00        inf 0.181818
  sray                raab          1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                benequist          1    0.05         0   
 0.00      0.00        inf 0.181818
  sray                 bosch           1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                rjenkins         1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                  esmith            1    0.05           0   
 0.00      0.00 1.7226e+07 0.054545
  sray                  gheinz            1    0.05           0   
 0.00      0.00 1.9074e+14 0.072727
  sray                  jfitz         1  0.05           0 
   0.00      0.00 8.0640e+20   0.081818
  sray                   ajoel          1  0.05       42449   
 0.069465      0.137396 0.363913   0.018182
  sray                  jmay           1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                 aferrier            1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                bdehaven         1    0.05  225002   
 0.367771      0.727420   0.068736   0.009091
  sray                msmythe          1    0.05       0   
 0.00      0.00        inf   0.181818
  sray                 gfink           1    0.05          

Re: [slurm-users] FairShare

2020-12-02 Thread Micheal Krombopulous
Yes, that concept of rank tripped me up. The "count of user associations that 
start at root" you mean? Do you mean all associations across all accounts or 
just the account being examined? Then you say "final fairshare factor similar 
to (rank-- / user_assoc_count)". Wouldn't that equal 1? I'm clearly not 
understanding something fundamental.

From: Ryan Cox 
Sent: Wednesday, December 2, 2020 10:43 AM
To: Slurm User Community List ; Erik Bryer 
; Micheal Krombopulous 
Subject: Re: [slurm-users] FairShare

>From https://slurm.schedmd.com/fair_tree.html:
The basic idea is to set rank equal to the count of user associations then 
start at root:
*   Calculate Level Fairshare for the subtree's children
*   Sort children of the subtree
*   Visit the children in descending order:
-If user, assign a final fairshare factor similar to (rank-- / 
user_assoc_count)
-If account, descend to account


On 12/2/20 10:34 AM, Erik Bryer wrote:
I'm not talking about the Level Fair Share. That's easy to compute. I'm talking 
about Fair Share -- what sshare prints out on the rightmost side.

From: slurm-users 

 on behalf of Ryan Cox 
Sent: Wednesday, December 2, 2020 10:31 AM
To: Slurm User Community List 
; Micheal 
Krombopulous 

Subject: Re: [slurm-users] FairShare

It's really similar to a binary search tree.  Within each account, it is Shares 
/ Usage to calculate the Level FS.  See 
https://slurm.schedmd.com/SUG14/fair_tree.pdf has more details, starting at 
page 34 or so.  It even has an "animation".

Ryan

On 12/2/20 10:22 AM, Micheal Krombopulous wrote:


Re: [slurm-users] FairShare

2020-12-02 Thread Micheal Krombopulous
You seem to be saying sort the users in my account ((rank-1)/user count)=FS (no 
subaccounts). But that doesn't calculate the FS values I'm seeing. I still see 
no way to calculate ~FS.

From: slurm-users  on behalf of Micheal 
Krombopulous 
Sent: Wednesday, December 2, 2020 10:55 AM
To: Ryan Cox ; Slurm User Community List 
; Erik Bryer 
Subject: Re: [slurm-users] FairShare

Yes, that concept of rank tripped me up. The "count of user associations that 
start at root" you mean? Do you mean all associations across all accounts or 
just the account being examined? Then you say "final fairshare factor similar 
to (rank-- / user_assoc_count)". Wouldn't that equal 1? I'm clearly not 
understanding something fundamental.

From: Ryan Cox 
Sent: Wednesday, December 2, 2020 10:43 AM
To: Slurm User Community List ; Erik Bryer 
; Micheal Krombopulous 
Subject: Re: [slurm-users] FairShare

>From https://slurm.schedmd.com/fair_tree.html:
The basic idea is to set rank equal to the count of user associations then 
start at root:
*   Calculate Level Fairshare for the subtree's children
*   Sort children of the subtree
*   Visit the children in descending order:
-If user, assign a final fairshare factor similar to (rank-- / 
user_assoc_count)
-If account, descend to account


On 12/2/20 10:34 AM, Erik Bryer wrote:
I'm not talking about the Level Fair Share. That's easy to compute. I'm talking 
about Fair Share -- what sshare prints out on the rightmost side.

From: slurm-users 

 on behalf of Ryan Cox 
Sent: Wednesday, December 2, 2020 10:31 AM
To: Slurm User Community List 
; Micheal 
Krombopulous 

Subject: Re: [slurm-users] FairShare

It's really similar to a binary search tree.  Within each account, it is Shares 
/ Usage to calculate the Level FS.  See 
https://slurm.schedmd.com/SUG14/fair_tree.pdf has more details, starting at 
page 34 or so.  It even has an "animation".

Ryan

On 12/2/20 10:22 AM, Micheal Krombopulous wrote:


Re: [slurm-users] FairShare

2020-12-02 Thread Paul Edmon

Yup, our doc is for the classic fairshare not for fairtree.

Thanks for the kudos on the doc by the way.  We are glad it is useful.

-Paul Edmon-

On 12/2/2020 12:45 PM, Ryan Cox wrote:

That is not for Fair Tree, which is what Micheal asked about.

Ryan

On 12/2/20 10:32 AM, Renfro, Michael wrote:


Yesterday, I posted https://docs.rc.fas.harvard.edu/kb/fairshare/ 
in 
response to a similar question. If you want the simplest general 
explanation for FairShare values, it's that they range from 0.0 to 
1.0, values above 0.5 indicate that account or user has used less 
than their share of the resource, and values below 0.5 indicate that 
that account or user has used more than their share of the resource.


Since all your users have the same RawShares value and are entitled 
to the same share of the resource, you can see that bdehaven has the 
most RawUsage and the lowest FairShare value, followed by ajoel and 
xtsao with almost identical RawUsage and FairShare, and finally 
ahantau with very little usage and the highest FairShare value.


We use FairShare here as the dominant factor in priorities for queued 
jobs: if you're a light user, we bump up your priority over heavier 
users, and your job starts quicker than those for heavier users, 
assuming all other job attributes are equal.


All these values are relative: in our setup, we'd bump ahantau's 
pending jobs ahead of the others, and put bdehaven's at the end. But 
if root needed to run a job outside the sray account, they'd get an 
enormous bump ahead since the sray account has used far more than its 
fair share of the resource.


*From: *slurm-users 
*Date: *Wednesday, December 2, 2020 at 11:23 AM
*To: *slurm-users@lists.schedmd.com 
*Subject: *Re: [slurm-users] FairShare

*External Email Warning*

*This email originated from outside the university. Please use 
caution when opening attachments, clicking links, or responding to 
requests.*




I've read the manual and I re-read the other link. What they boil 
down to is Fair Share is calculated based on a recondite "rooted 
plane tree", which I do not have the background in discrete math to 
understand.


I'm hoping someone can explain it so my little kernel can understand.



*From:*slurm-users  on behalf 
of Micheal Krombopulous 

*Sent:* Wednesday, December 2, 2020 9:32 AM
*To:* slurm-users@lists.schedmd.com 
*Subject:* [slurm-users] FairShare

Can someone tell me how to calculate fairshare (under fairtree)? I 
can't figure it out. I would have thought it would be the same score 
for all users in an account. E.g., here is one of my accounts:


Account User  RawShares  NormShares    RawUsage NormUsage 
 EffectvUsage    LevelFS  FairShare
 -- -- --- --- 
--- - -- --

root  0.00      611349                  1.00
 root                      root             1  0.076923           0   
 0.00      0.00  inf   1.00
 sray                                  1  0.076923     
 30921 0.505582      0.505582   0.152147
  sray                 phedge            1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                raab          1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                benequist          1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                 bosch           1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                rjenkins         1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                  esmith            1  0.05           0   
 0.00      0.00 1.7226e+07   0.054545
  sray                  gheinz            1  0.05           0   
 0.00      0.00 1.9074e+14   0.072727
  sray                  jfitz         1  0.05           0 
   0.00      0.00 8.0640e+20   0.081818
  sray                   ajoel          1  0.05       42449   
 0.069465      0.137396 0.363913   0.018182
  sray                  jmay           1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                 aferrier            1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                bdehaven         1    0.05    225002   
 0.367771      0.727420   0.068736 0.009091
  sray 

Re: [slurm-users] slurm_pam_adapt & configless - set-up

2020-12-02 Thread Ole Holm Nielsen

Hi Frank,

You must update Slurm to a more recent version due to a configless bug 
that existed in early versions of 20.02.


/Ole


On 02-12-2020 15:50, Heckes, Frank wrote:

Hello all,
sorry if this has been asked and/or answered before. I couldn’t find a 
posting related to my problem.


I’m using slurm 20.02.01 and use a configless – set-up for all login and 
compute nodes.


I set-up slurm PAM on a test node following the instructions at 
https://slurm.schedmd.com/pam_slurm_adopt.html.


Without the ‘existance’ of /etc/slurm/slurm.conf (and world readable 
permissions for all dirs and the file) I see the following error message 
in /messages/ or /journal/:


error: s_p_parse_file: unable to status file /etc/slurm/slurm.conf: No 
such file or directory, retrying in 1sec up to 60sec


The effect is that it takes awful long to login to the node.

Once I copy the slurm.conf (actually the thing I tried to avoid 
utilizing the configless set-up), the ssh login is permit without 
latency if a job is running and not allowed if no jobs are running for 
the user trying to access the node. Does anyone know a workaround 
(besides ‘mine’) or solution to fix this? Many thanks in advance.