I'm working through how to use the new dynamic node features in order to take 
down a particular node, reconfigure it (using nvidia MIG to change the number 
of graphic cores available) and give it back to slurm.

I'm at the point where I can take a node out of slurm's control from the master 
node (scontrol delete nodename....), make the nvidia-smi change, and then 
execute slurmd on the node with the changed configuration parameters.  It then 
does show up again in the sinfo output on the master node, with the correct new 
resources.

What I'm not sure about is...when I want to reconfigure the dynamic node AGAIN, 
how do I do that on the target node?  I can use "scontrol delete" again on the 
scheduler node, but on the dynamic node, slurmd will still be running.  
Currently, for testing purposes, I just find the process ID and kill -9 it.  
Then I change the node configuration and execute "slurmd -Z --conf=...." again.

Is there a more elegant way to change the configuration on the dynamic node 
than by killing the existing slurmd process and starting it again?

I'll note that I tried doing everything from the master (slurmctld) node, since 
there is an option of creating the node there with "scontrol create" instead of 
using slurmd on the dynamic node.  But when i tried that, the dynamic node I 
created showed up in sinfo output with a ~ next to it (powered off).  The 
dynamic node docs page online did not mention what, if anything, slurmd was 
supposed to be running as on the dynamic node if attempting to handle delete 
and create only on the master node.

Thanks.

Rob

Reply via email to