Re: [slurm-users] slurmd -C showing incorrect core count

Ryan Novosielski Thu, 12 Mar 2020 23:37:01 -0700

From what I know of how this works, no, it’s not getting it from a local file 
or the master node. I don’t believe it even makes a network connection, nor 
requires a slurm.conf in order to run. If you can run it fresh on a node with 
no config and that’s what it comes up with, it’s probably getting it from the 
VM somehow.


--
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
     `'

> On Mar 11, 2020, at 10:26 AM, mike tie <m...@carleton.edu> wrote:
> 
> 
> Yep, slurmd -C is obviously getting the data from somewhere, either a local 
> file or from the master node.  hence my email to the group;  I was hoping 
> that someone would just say:  "yeah, modify file xxxx".  But oh well. I'll 
> start playing with strace and gdb later this week;  looking through the 
> source might also be helpful.  
> 
> I'm not cloning existing virtual machines with slurm.  I have access to a 
> vmware system that from time to time isn't running at full capacity;  usage 
> is stable for blocks of a month or two at a time, so my thought/plan was to 
> spin up a slurm compute node  on it, and resize it appropriately every few 
> months (why not put it to work).  I started with 10 cores, and it looks like 
> I can up it to 16 cores for a while, and that's when I ran into the problem.
> 
> -mike
> 
> 
> 
> Michael Tie    
> Technical Director    
> Mathematics, Statistics, and Computer Science    
> 
>  One North College Street              phn:  507-222-4067
>  Northfield, MN 55057                   cel:    952-212-8933
>  m...@carleton.edu                        fax:    507-222-4312
> 
> 
> 
> On Wed, Mar 11, 2020 at 1:15 AM Kirill 'kkm' Katsnelson <k...@pobox.com> 
> wrote:
> On Tue, Mar 10, 2020 at 1:41 PM mike tie <m...@carleton.edu> wrote:
> Here is the output of lstopo
> 
> $ lstopo -p
> Machine (63GB)
>   Package P#0 + L3 (16MB)
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#2
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#3
>   Package P#1 + L3 (16MB)
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#4
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#5
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#6
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#7
>   Package P#2 + L3 (16MB)
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#8
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#9
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#10
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#11
>   Package P#3 + L3 (16MB)
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#12
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#13
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#14
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#15
> 
> There is no sane way to derive the number 10 from this topology. obviously: 
> it has a prime factor of 5, but everything in the lstopo output is sized in 
> powers of 2 (4 packages, a.k.a.  sockets, 4 single-threaded CPU cores per). 
> 
> I responded yesterday but somehow managed to plop my signature into the 
> middle of it, so maybe you have missed inline replies?
> 
> It's very, very likely that the number is stored *somewhere*. First to 
> eliminate is the hypothesis that the number is acquired from the control 
> daemon. That's the simplest step and the largest landgrab in the 
> divide-and-conquer analysis plan. Then just look where it comes from on the 
> VM. strace(1) will reveal all files slurmd reads. 
> 
> You are not rolling out the VMs from an image, ain't you? I'm wondering why 
> do you need to tweak an existing VM that is already in a weird state. Is 
> simply setting its snapshot aside and creating a new one from an image 
> hard/impossible? I did not touch VMWare for more than 10 years, so I may be a 
> bit naive; in the platform I'm working now (GCE), create-use-drop pattern of 
> VM use is much more common and simpler than create and maintain it to either 
> *ad infinitum* or *ad nauseam*, whichever will have been reached the 
> earliest.  But I don't know anything about VMWare; maybe it's not possible or 
> feasible with it.
> 
>  -kkm
>

Re: [slurm-users] slurmd -C showing incorrect core count

Reply via email to