I would like all partitions to have exclusive node access except for a
single "transfer" partition which we will allow job sharing of the single
node. The partition flag for this appears to be OverSubscribe. But there are
no choices for sharing a node, just sharing a core, cpu or socket. Looks
like
I didn't say I felt slurmd could not run as a service on a dynamic node. I'm
just saying that the example they give on their dynamic nodes webpage does not
show slurmd running as a service. So they seem to imply there's a different
way, other than with slurmd running as a service, that you can
You shouldn't have to change any parameters if you have it configured in
the defaults. Just systemctl stop/start slurmd as needed.
something like:
scontrol update state=drain nodename= reason="MIG reconfig"
ssh "systemctl stop slurmd"
ssh "systemctl start slurmd"
Not sure what would
Ya, we're still working out the mechanism for taking the node out, making the
changes, and bringing it back. But the part I can't figure out is slurmd
running on the remote node. What do I do with it? Do I run it standalone, and
when I need to reconfigure, I kill -9 it and execute it again wit
Just off the top of my head here.
I would expect you need to have no jobs currently running on the node,
so you could could submit a job to the node that sets the node to drain,
does any local things needed, then exits. As part of the EpilogSlurmctld
script, you could check for drained nodes
I'm working through how to use the new dynamic node features in order to take
down a particular node, reconfigure it (using nvidia MIG to change the number
of graphic cores available) and give it back to slurm.
I'm at the point where I can take a node out of slurm's control from the master
nod