Following up to my previous response:
You could have your keepalived/maxscale/mariadb/slurmdbd setup on 2 servers. We
chose to break it out for maximum resiliency of backend resource types. You
have to have 2 database instances with their own storage space, and use
replication. I do not know of
Hi;
Are you sure this is a job task completing issue. When the epilog script
fails, slurm will set node to DRAIN state:
"If the Epilog fails (returns a non-zero exit code), this will result in
the node being set to a DRAIN state"
https://slurm.schedmd.com/prolog_epilog.html
You can test th
Hi;
I think you can use pacemaker cluster for a virtual slurmdb server. A
virtual slurmdb server which runs both slurmdb and mysql services on the
active slurmctl server. When the active slurmctl server die, You can try
to start on the passive one.
Regards;
Ahmet M.
23.07.2020 19:12 tarih
Peter,
I believe that the answer to your database question is that you don't have two
MySQL/MariaDB servers running at the same time. The only way that I know of to
run MySQL/MariaDB in an active-active setup, which is what you appear to be
describing, is with replication. The other setup is to
Hi Folks,
Thanks for responses.
I probably didn't make my initial point totally clear, so following up
with clarification.
The NFS server is considered to be sufficiently highly available
("Designed for 99.% availability with redundant hot-swap components,
including controllers and I/O mo
Thanks for the input guys!
We don’t even use lustre filesystems…and It doesn’t appear to be I/O.
I execute iostat on both head node and compute node when the job is in CG
status and the %iowait value is 0.00 or 0.01
$ iostat
Linux 3.10.0-957.el7.x86_64 (node002) 07/22/2020 _x86_64_
After a complete shutdown and restart of all daemons, things have changed
somewhat
# scontrol show nodes | egrep '(^Node|Gres)'
NodeName=mlscgpu1 Arch=x86_64 CoresPerSocket=16
Gres=gpu:quadro_rtx_6000:10(S:0)
NodeName=mlscgpu2 Arch=x86_64 CoresPerSocket=16
Gres=gpu:quadro_rtx_6000:5(S:0)
I have two systems in my cluster with GPUs. Their setup in slurm.conf is
GresTypes=gpu
NodeName=mlscgpu1 Gres=gpu:quadro_rtx_6000:10 CPUs=64 Boards=1
SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557
NodeName=mlscgpu2 Gres=gpu:quadro_rtx_6000:5 CPUs=64 Boards=1
SocketsP
Same here. Whenever we see rashes of Kill task failed it is invariably
symptomatic of one of our Lustre filesystems acting up or being saturated.
-Paul Edmon-
On 7/22/2020 3:21 PM, Ryan Cox wrote:
Angelos,
I'm glad you mentioned UnkillableStepProgram. We meant to look at
that a while ago b