Re: [slurm-users] [External] Fwd: Slurm MySQL database configuration

2020-07-23 Thread Chad Cropper
Following up to my previous response: You could have your keepalived/maxscale/mariadb/slurmdbd setup on 2 servers. We chose to break it out for maximum resiliency of backend resource types. You have to have 2 database instances with their own storage space, and use replication. I do not know of

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2020-07-23 Thread mercan
Hi; Are you sure this is a job task completing issue. When the epilog script fails, slurm will set node to DRAIN state: "If the Epilog fails (returns a non-zero exit code), this will result in the node being set to a DRAIN state" https://slurm.schedmd.com/prolog_epilog.html You can test th

Re: [slurm-users] [External] Fwd: Slurm MySQL database configuration

2020-07-23 Thread mercan
Hi; I think you can use pacemaker cluster for a virtual slurmdb server. A virtual slurmdb server which runs both slurmdb and mysql services on the active slurmctl server. When the active slurmctl server die, You can try to start on the passive one. Regards; Ahmet M. 23.07.2020 19:12 tarih

Re: [slurm-users] [External] Fwd: Slurm MySQL database configuration

2020-07-23 Thread Michael Robbert
Peter, I believe that the answer to your database question is that you don't have two MySQL/MariaDB servers running at the same time. The only way that I know of to run MySQL/MariaDB in an active-active setup, which is what you appear to be describing, is with replication. The other setup is to

[slurm-users] Fwd: Slurm MySQL database configuration

2020-07-23 Thread Peter Mayes
Hi Folks, Thanks for responses. I probably didn't make my initial point totally clear, so following up with clarification. The NFS server is considered to be sufficiently highly available ("Designed for 99.% availability with redundant hot-swap components, including controllers and I/O mo

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2020-07-23 Thread Ivan Kovanda
Thanks for the input guys! We don’t even use lustre filesystems…and It doesn’t appear to be I/O. I execute iostat on both head node and compute node when the job is in CG status and the %iowait value is 0.00 or 0.01 $ iostat Linux 3.10.0-957.el7.x86_64 (node002) 07/22/2020 _x86_64_

Re: [slurm-users] GPU configuration not working

2020-07-23 Thread Paul Raines
After a complete shutdown and restart of all daemons, things have changed somewhat # scontrol show nodes | egrep '(^Node|Gres)' NodeName=mlscgpu1 Arch=x86_64 CoresPerSocket=16 Gres=gpu:quadro_rtx_6000:10(S:0) NodeName=mlscgpu2 Arch=x86_64 CoresPerSocket=16 Gres=gpu:quadro_rtx_6000:5(S:0)

[slurm-users] GPU configuration not working

2020-07-23 Thread Paul Raines
I have two systems in my cluster with GPUs. Their setup in slurm.conf is GresTypes=gpu NodeName=mlscgpu1 Gres=gpu:quadro_rtx_6000:10 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557 NodeName=mlscgpu2 Gres=gpu:quadro_rtx_6000:5 CPUs=64 Boards=1 SocketsP

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2020-07-23 Thread Paul Edmon
Same here.  Whenever we see rashes of Kill task failed it is invariably symptomatic of one of our Lustre filesystems acting up or being saturated. -Paul Edmon- On 7/22/2020 3:21 PM, Ryan Cox wrote: Angelos, I'm glad you mentioned UnkillableStepProgram.  We meant to look at that a while ago b