[slurm-users] How to share a node

2022-09-23 Thread Borchert, Christopher B ERDC-RDE-ITL-MS CIV
I would like all partitions to have exclusive node access except for a single "transfer" partition which we will allow job sharing of the single node. The partition flag for this appears to be OverSubscribe. But there are no choices for sharing a node, just sharing a core, cpu or socket. Looks like

Re: [slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Groner, Rob
I didn't say I felt slurmd could not run as a service on a dynamic node. I'm just saying that the example they give on their dynamic nodes webpage does not show slurmd running as a service. So they seem to imply there's a different way, other than with slurmd running as a service, that you can

Re: [slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Brian Andrus
You shouldn't have to change any parameters if you have it configured in the defaults. Just systemctl stop/start slurmd as needed. something like: scontrol update state=drain nodename= reason="MIG reconfig" ssh "systemctl stop slurmd" ssh "systemctl start slurmd" Not sure what would

Re: [slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Groner, Rob
Ya, we're still working out the mechanism for taking the node out, making the changes, and bringing it back. But the part I can't figure out is slurmd running on the remote node. What do I do with it? Do I run it standalone, and when I need to reconfigure, I kill -9 it and execute it again wit

Re: [slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Brian Andrus
Just off the top of my head here. I would expect you need to have no jobs currently running on the node, so you could could submit a job to the node that sets the node to drain, does any local things needed, then exits. As part of the EpilogSlurmctld script, you could check for drained nodes

[slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Groner, Rob
I'm working through how to use the new dynamic node features in order to take down a particular node, reconfigure it (using nvidia MIG to change the number of graphic cores available) and give it back to slurm. I'm at the point where I can take a node out of slurm's control from the master nod