Re: [slurm-users] slurmd and dynamic nodes

Groner, Rob Fri, 23 Sep 2022 10:00:44 -0700

I didn't say I felt slurmd could not run as a service on a dynamic node.  I'm 
just saying that the example they give on their dynamic nodes webpage does not 
show slurmd running as a service.  So they seem to imply there's a different 
way, other than with slurmd running as a service, that you can create a dynamic 
node with a different configuration.  In their example, they just execute 
slurmd with parameters on the command line.  So...not as a service.


I'm fine with the concept of stopping the service, changing the service 
parameters for the new configuration of the node, and then starting the service 
again.  That's fine, and that makes sense.  What I'm trying to say is that 
their documentation does not demonstrate that way of handling dynamic nodes.  
So I'm trying to figure out what they meant to have happen to a dynamic node 
where slurmd is already running as a process and not as a service.   Is there 
SOME OTHER WAY they expected that a dynamic node could reconfigure itself other 
than through stopping/starting a service?

I think their limited documentation on dynamic nodes basically only covers 
creating a node ONCE and removing it ONCE, and not a scenario where you might 
reconfigure a single node multiple times in its life.   Given that, and having 
the service method of making it work, I'll just go with that.  Thanks for help.

Rob

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Brian 
Andrus <toomuc...@gmail.com>
Sent: Friday, September 23, 2022 12:24 PM
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] slurmd and dynamic nodes

You don't often get email from toomuc...@gmail.com. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>

You shouldn't have to change any parameters if you have it configured in the 
defaults. Just systemctl stop/start slurmd as needed.


something like:

scontrol update state=drain nodename=<node_to_change> reason="MIG reconfig"

<wait for it to be drained>

ssh <node_to_change> "systemctl stop slurmd"

<run reconfig stuff>

ssh <node_to_change> "systemctl start slurmd"


Not sure what would make you feel slurmd cannot run as a service on a dynamic 
node. As long as you added the options to the systemd defaults file for it, you 
should be fine (usually /etc/defaults/slurmd)


Brian


On 9/23/2022 7:40 AM, Groner, Rob wrote:
Ya, we're still working out the mechanism for taking the node out, making the 
changes, and bringing it back. But the part I can't figure out is slurmd 
running on the remote node.  What do I do with it?  Do I run it standalone, and 
when I need to reconfigure, I kill -9 it and execute it again with the new 
configuration?  Or what if slurmd is running as a service (as it does on all 
our non-dynamic nodes)?  Do I stop it, change its service parameters and then 
restart it to reconfigure the node?  The docs on slurm for dynamic nodes don't 
give any indication of how you handle slurmd running on the dynamic node.  What 
is the preferred method?

Rob

________________________________
From: slurm-users 
<slurm-users-boun...@lists.schedmd.com><mailto:slurm-users-boun...@lists.schedmd.com>
 on behalf of Brian Andrus <toomuc...@gmail.com><mailto:toomuc...@gmail.com>
Sent: Friday, September 23, 2022 10:24 AM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> 
<slurm-users@lists.schedmd.com><mailto:slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] slurmd and dynamic nodes


You don't often get email from toomuc...@gmail.com<mailto:toomuc...@gmail.com>. 
Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>



Just off the top of my head here.

I would expect you need to have no jobs currently running on the node, so you 
could could submit a job to the node that sets the node to drain, does any 
local things needed, then exits. As part of the EpilogSlurmctld script, you 
could check for drained nodes based on some reason (like 'MIG reconfig') and do 
the head node steps there, with a final bit of bringing it back online.


Or just do all those steps from a script outside slurm itself, on the head 
node. You can use ssh/pdsh to connect to a node and execute things there while 
it is out of the mix.


Brian Andrus


On 9/23/2022 7:09 AM, Groner, Rob wrote:

I'm working through how to use the new dynamic node features in order to take 
down a particular node, reconfigure it (using nvidia MIG to change the number 
of graphic cores available) and give it back to slurm.

I'm at the point where I can take a node out of slurm's control from the master 
node (scontrol delete nodename....), make the nvidia-smi change, and then 
execute slurmd on the node with the changed configuration parameters.  It then 
does show up again in the sinfo output on the master node, with the correct new 
resources.

What I'm not sure about is...when I want to reconfigure the dynamic node AGAIN, 
how do I do that on the target node?  I can use "scontrol delete" again on the 
scheduler node, but on the dynamic node, slurmd will still be running.  
Currently, for testing purposes, I just find the process ID and kill -9 it.  
Then I change the node configuration and execute "slurmd -Z --conf=...." again.

Is there a more elegant way to change the configuration on the dynamic node 
than by killing the existing slurmd process and starting it again?

I'll note that I tried doing everything from the master (slurmctld) node, since 
there is an option of creating the node there with "scontrol create" instead of 
using slurmd on the dynamic node.  But when i tried that, the dynamic node I 
created showed up in sinfo output with a ~ next to it (powered off).  The 
dynamic node docs page online did not mention what, if anything, slurmd was 
supposed to be running as on the dynamic node if attempting to handle delete 
and create only on the master node.

Thanks.

Rob

Re: [slurm-users] slurmd and dynamic nodes

Reply via email to