Hi,
I have configured slurm cloud scheduling for OpenStack. I am using CentOS7
with slurm version 20.11.8 installed using EPEL RPMs and it's working fine
but I am getting some strange errors in the slurm master logs which I think
are a bug.
I am using these options in slurm.conf:
SlurmctldParamet
Hi Alan and Paul,
I can't clain to be a Lustre guru but my understanding is that Lustre
failover does not imply umount/mount of the file system on the client
side. On the client side the OSTs just stall until they are back. So
open file handles should actually be kept during that process.
However
I think it depends on the filesystem type. Lustre generally fails over
nicely and handles reconnections with out much of a problem. We've done
this before with out any hitches, even with the jobs being live.
Generally the jobs just hang and then resolve once the filesystem comes
back. On a
Dear Jurgen and Paul,
This is an interesting strategy, thanks for sharing. So if I read the
scontrol man page correctly, `scontrol suspend` sends a SIGSTOP to all job
processes. The processes remain in memory, but are paused. What happens to
open file handles, since the underlying filesystem goes