Hi,all
We run slurm 19.05 on a cluster about 1k nodes,recently, we found lots of
job failed due to node failure; check slumctld.log we found nodes are set
to down stat then resumed quikly.
some log info:
[2020-07-20T00:21:23.306] error: Nodes j[1608,1802] not responding
[2020-07-20T00:22:27.486] e
We just recently did the following with great success so far:
- 2 Centos7 servers setup with MariaDB 10.4 Community
- 1 Virtualbox VM on NFS storage
- 1 LXD container on CEPH storage
- Goal was to not rely on same server/storage for both SLURMDBD services
- SLURMDBD service configu
You need a single database connection string that both slurmdbd daemons
point to.
To have high availability, you should do that with the db configurations
(HA, replication, etc), which makes sense, as you want the database to
be HA.
The slurm backup configuration is to protect from your slur
Hi,
My first post to the list, so apologies if this is a FAQ,
My configuration has two nodes allocated for Slurm masters, with a
highly-available NFS server mounting a filesystem across the two nodes.
I need advice on the best configuration.
I naively thought of having a single MariaDB databas