Re: [slurm-users] Suspending jobs for file system maintenance

Paul Edmon Mon, 25 Oct 2021 06:00:54 -0700

I think it depends on the filesystem type. Lustre generally fails overnicely and handles reconnections with out much of a problem. We've donethis before with out any hitches, even with the jobs being live. Generally the jobs just hang and then resolve once the filesystem comesback. On a live system you will end up with a completion storm as jobsare always exiting and thus while the filesystem is gone the jobsdependent on it will just hang and if they are completing they will juststall on the completion step. Once it returns then all that trafficflushes. This can create issues where a bunch of nodes get closed due toKill task fail or other completion flags. Generally these are harmlessthough I have seen stuck processes on nodes and have had to reboot themto clear, so you should check any node before putting it back in action.

That said if you are pausing all the jobs and scheduling this is somewhat mitigated, though jobs will still exit due to timeout.


-Paul Edmon-

On 10/25/2021 4:47 AM, Alan Orth wrote:

Dear Jurgen and Paul,

This is an interesting strategy, thanks for sharing. So if I read thescontrol man page correctly, `scontrol suspend` sends a SIGSTOP to alljob processes. The processes remain in memory, but are paused. Whathappens to open file handles, since the underlying filesystem goesaway and comes back?


Thank you,

On Sat, Oct 23, 2021 at 1:10 AM Juergen Salk <juergen.s...@uni-ulm.de>wrote:


    Thanks, Paul, for confirming our planned approach. We did it that way
    and it worked very well. I have to admit that my fingers were a bit
    wet when suspending thousands of running jobs, but it worked without
    any problems. I just didn't dare to resume all suspended jobs at
    once, but did that in a staggered manner.

    Best regards
    Jürgen

    * Paul Edmon <ped...@cfa.harvard.edu> [211019 15:15]:
    > Yup, we follow the same process for when we do Slurm upgrades,
    this looks
    > analogous to our process.
    >
    > -Paul Edmon-
    >
    > On 10/19/2021 3:06 PM, Juergen Salk wrote:
    > > Dear all,
    > >
    > > we are planning to perform some maintenance work on our Lustre
    file system
    > > which may or may not harm running jobs. Although failover
    functionality is
    > > enabled on the Lustre servers we'd like to minimize risk for
    running jobs
    > > in case something goes wrong.
    > >
    > > Therefore, we thought about suspending all running jobs and resume
    > > them as soon as file systems are back again.
    > >
    > > The idea would be to stop Slurm from scheduling new jobs as a
    first step:
    > >
    > > # for p in foo bar baz; do scontrol update PartitionName=$p
    State=DOWN; done
    > >
    > > with foo, bar and baz being the configured partitions.
    > >
    > > Then suspend all running jobs (taking job arrays into account):
    > >
    > > # squeue -ho %A -t R | xargs -n 1 scontrol suspend
    > >
    > > Then perform the failover of OSTs to another OSS server.
    > > Once done, verify that file system is fully back and all
    > > OSTs are in place again on the client nodes.
    > >
    > > Then resume all suspended jobs:
    > >
    > > # squeue -ho %A -t S | xargs -n 1 scontrol resume
    > >
    > > Finally bring back the partitions:
    > >
    > > # for p in foo bar baz; do scontrol update PartitionName=$p
    State=UP; done
    > >
    > > Does that make sense? Is that common practice? Are there any
    caveats that
    > > we must think about?
    > >
    > > Thank you in advance for your thoughts.
    > >
    > > Best regards
    > > Jürgen
    > >



--
Alan Orth
alan.o...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch

Re: [slurm-users] Suspending jobs for file system maintenance

Reply via email to