One possibility:
Sounds like your concern is folks with interactive jobs from the login
node that are running under screen/tmux.
That being the case, you need running jobs to end and not allow new
users to start tmux sessions.
Definitely doing 'scontrol update state=down partition=xxxx' for each
partition. Also:
touch /etc/nologin
That will prevent new logins.
Send a message to all active folks
wall "system going down at XX:XX, please end your sessions"
Then wait for folks to drain off your login node and do your stuff.
When done, remove the /etc/nologin file and folks will be able to login
again.
Brian Andrus
On 1/31/2022 9:18 PM, Sid Young wrote:
Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W: (personal) https://z900collector.wordpress.com/
On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel <ch...@csamuel.org>
wrote:
On 1/31/22 4:41 pm, Sid Young wrote:
> I need to replace a faulty DIMM chim in our login node so I need
to stop
> new jobs being kicked off while letting the old ones end.
>
> I thought I would just set all nodes to drain to stop new jobs from
> being kicked off...
That would basically be the way, but is there any reason why compute
jobs shouldn't start whilst the login node is down?
My concern was to keep the running jobs going and stop new jobs, so
when the last running job ends,
I could reboot the login node knowing that any terminal windows
"screen"/"tmux" sessions would effectively
have ended as the job(s) had now ended
I'm not sure if there was an accepted procedure or best practice way
to tackle shutting down the Login node for this use case.
On the bright side I am down to two jobs left so any day now :)
Sid
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA