date:20200721

[slurm-users] lots of job failed due to node failure

2020-07-21 Thread 肖正刚

Hi,all We run slurm 19.05 on a cluster about 1k nodes，recently, we found lots of job failed due to node failure; check slumctld.log we found nodes are set to down stat then resumed quikly. some log info: [2020-07-20T00:21:23.306] error: Nodes j[1608,1802] not responding [2020-07-20T00:22:27.486] e

Re: [slurm-users] Slurm MySQL database configuration

2020-07-21 Thread Chad Cropper

We just recently did the following with great success so far: - 2 Centos7 servers setup with MariaDB 10.4 Community - 1 Virtualbox VM on NFS storage - 1 LXD container on CEPH storage - Goal was to not rely on same server/storage for both SLURMDBD services - SLURMDBD service configu

Re: [slurm-users] Slurm MySQL database configuration

2020-07-21 Thread Brian Andrus

You need a single database connection string that both slurmdbd daemons point to. To have high availability, you should do that with the db configurations (HA, replication, etc), which makes sense, as you want the database to be HA. The slurm backup configuration is to protect from your slur

[slurm-users] Slurm MySQL database configuration

2020-07-21 Thread Peter Mayes

Hi, My first post to the list, so apologies if this is a FAQ, My configuration has two nodes allocated for Slurm masters, with a highly-available NFS server mounting a filesystem across the two nodes. I need advice on the best configuration. I naively thought of having a single MariaDB databas