Hi Alan and Paul,
I can't clain to be a Lustre guru but my understanding is that Lustre
failover does not imply umount/mount of the file system on the client
side. On the client side the OSTs just stall until they are back. So
open file handles should actually be kept during that process.
However
I think it depends on the filesystem type. Lustre generally fails over
nicely and handles reconnections with out much of a problem. We've done
this before with out any hitches, even with the jobs being live.
Generally the jobs just hang and then resolve once the filesystem comes
back. On a
Dear Jurgen and Paul,
This is an interesting strategy, thanks for sharing. So if I read the
scontrol man page correctly, `scontrol suspend` sends a SIGSTOP to all job
processes. The processes remain in memory, but are paused. What happens to
open file handles, since the underlying filesystem goes
Thanks, Paul, for confirming our planned approach. We did it that way
and it worked very well. I have to admit that my fingers were a bit
wet when suspending thousands of running jobs, but it worked without
any problems. I just didn't dare to resume all suspended jobs at
once, but did that in a sta
Yup, we follow the same process for when we do Slurm upgrades, this
looks analogous to our process.
-Paul Edmon-
On 10/19/2021 3:06 PM, Juergen Salk wrote:
Dear all,
we are planning to perform some maintenance work on our Lustre file system
which may or may not harm running jobs. Although fai
Dear all,
we are planning to perform some maintenance work on our Lustre file system
which may or may not harm running jobs. Although failover functionality is
enabled on the Lustre servers we'd like to minimize risk for running jobs
in case something goes wrong.
Therefore, we thought about s